feat: GSD context optimization with model routing and context masking

* docs: add context optimization design spec, implementation plan, and pi-layer research - Spec: 6-change design for GSD extension context optimization - Plan: 9-task TDD implementation plan with exact file paths and code - Pi-layer doc: 10 infrastructure opportunities (research only, not planned) Part of #3171, #3406, #3452, #3433. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(context): add observation masking for auto-mode sessions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(context): add phase handoff anchors for auto-mode Introduces PhaseAnchor read/write utilities so downstream agents can inherit decisions, blockers, and intent written at phase boundaries without re-inferring from conversation history. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(context): add capability-aware model routing and context management preferences Implement ADR-004 Phase 2 capability scoring with 7-dimension model profiles, task requirement vectors, and weighted scoring. Add ContextManagementConfig preferences for observation masking thresholds. Wire capability scoring into auto-model-selection dispatch path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(context): wire observation masking, phase anchors, and tool truncation Register observation masker in before_provider_request hook to replace old tool results with placeholders during auto-mode. Add tool result truncation (configurable via context_management.tool_result_max_chars). Inject phase handoff anchors into prompt builders so downstream phases inherit decisions from research/planning. Write anchors after successful phase completion. Update ADR-004 status to Implemented. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: remove internal planning artifacts from PR Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add capability routing, observation masking, and context management Update dynamic-model-routing.md with capability-aware scoring section. Update token-optimization.md with observation masking, tool truncation, and phase handoff anchor documentation. Update configuration.md with context_management preference block and capability_routing flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Merge branch 'main' into feat/gsd-context-optimization * fix: add context_management to known keys and prevent tool truncation state corruption - Add missing 'context_management' to KNOWN_PREFERENCE_KEYS set so users don't get spurious unknown-key warnings when configuring it. - Replace in-place mutation of tool result content with immutable spread to prevent corrupting shared conversation message objects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add stop and backtrack to triage-ui classification labels The Classification type gained stop and backtrack variants from main but triage-ui.ts was not updated, causing a TypeScript build failure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: context masker and tool truncation operate on correct pi-ai message format The observation masker and tool result truncation in before_provider_request were checking m.type === "toolResult" but the actual pi-ai payload uses m.role === "toolResult" with content as TextContent[] arrays (not strings). bashExecution messages are converted to {role:"user"} by convertToLlm before the hook fires, so checking m.type === "bashExecution" was a no-op. - Fix context-masker to match on role, handle array content, detect bash results by their "Ran `" prefix - Fix register-hooks truncation to operate on role:"toolResult" with array content blocks - Update tests to use correct pi-ai LLM payload format Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-04 01:02:35 -04:00 · 2026-04-04 01:02:35 -04:00 · 7d5bf63b2d
commit 7d5bf63b2d
parent bb47f5a087
22 changed files with 1068 additions and 28 deletions
--- a/docs/ADR-004-capability-aware-model-routing.md
+++ b/docs/ADR-004-capability-aware-model-routing.md
@ -1,8 +1,8 @@
 # ADR-004: Capability-Aware Model Routing

-**Status:** Proposed (Revised)
+**Status:** Implemented (Phase 2)
 **Date:** 2026-03-26
-**Revised:** 2026-03-26
+**Revised:** 2026-04-03
 **Deciders:** Jeremy McSpadden
 **Related:** ADR-003 (pipeline simplification), [Issue #2655](https://github.com/gsd-build/gsd-2/issues/2655), `docs/dynamic-model-routing.md`

--- a/docs/configuration.md
+++ b/docs/configuration.md
@ -686,6 +686,7 @@ Complexity-based model routing. See [Dynamic Model Routing](./dynamic-model-rout
 ```yaml
 dynamic_routing:
  enabled: true
+  capability_routing: true          # score models by task capability (v2.59)
  tier_models:
    light: claude-haiku-4-5
    standard: claude-sonnet-4-6
@ -695,6 +696,18 @@ dynamic_routing:
  cross_provider: true
 ```

+### `context_management` (v2.59)
+
+Controls observation masking and tool result truncation during auto-mode sessions. Reduces context bloat between compactions with zero LLM overhead.
+
+```yaml
+context_management:
+  observation_masking: true          # replace old tool results with placeholders (default: true)
+  observation_mask_turns: 8          # keep results from last N user turns (1-50, default: 8)
+  compaction_threshold_percent: 0.70 # target compaction at 70% context usage (0.5-0.95, default: 0.70)
+  tool_result_max_chars: 800         # cap individual tool result content (200-10000, default: 800)
+```
+
 ### `service_tier` (v2.42)

 OpenAI service tier preference for supported models. Toggle with `/gsd fast`.
--- a/docs/dynamic-model-routing.md
+++ b/docs/dynamic-model-routing.md
@ -70,6 +70,36 @@ When approaching the budget ceiling, the router progressively downgrades:

 When enabled, the router may select models from providers other than your primary. This uses the built-in cost table to find the cheapest model at each tier. Requires the target provider to be configured.

+## Capability-Aware Scoring
+
+*Introduced in v2.59.0 (ADR-004 Phase 2)*
+
+When `capability_routing` is enabled, the router goes beyond tier classification and scores models against task-specific capability requirements. Each known model has a 7-dimension profile:
+
+| Dimension | What It Measures |
+|-----------|-----------------|
+| `coding` | Code generation, refactoring, implementation quality |
+| `debugging` | Error diagnosis, fix accuracy |
+| `research` | Information gathering, codebase exploration |
+| `reasoning` | Multi-step logic, architectural decisions |
+| `speed` | Response latency (inverse of cost) |
+| `longContext` | Performance with large context windows |
+| `instruction` | Adherence to structured instructions and templates |
+
+Each unit type maps to a weighted requirement vector. For example, `execute-task` weights `coding: 0.9, reasoning: 0.6, debugging: 0.5` while `research-slice` weights `research: 0.9, reasoning: 0.7, longContext: 0.5`.
+
+For `execute-task` units, the classifier also inspects task metadata (tags, description) to refine requirements. Documentation tasks boost `instruction` and lower `coding`; test tasks boost `debugging`.
+
+Enable capability routing:
+
+```yaml
+dynamic_routing:
+  enabled: true
+  capability_routing: true
+```
+
+When enabled, models within the target tier are ranked by capability score rather than selected arbitrarily. When disabled (the default), the existing tier-only selection applies.
+
 ## Complexity Classification

 Units are classified using pure heuristics — no LLM calls, sub-millisecond:
--- a/docs/pi-context-optimization-opportunities.md
+++ b/docs/pi-context-optimization-opportunities.md
@ -0,0 +1,198 @@
+# pi-coding-agent: Context Optimization Opportunities
+
+> **Status**: Research only — not planned for implementation.
+> Scope: `packages/pi-coding-agent` and `packages/pi-agent-core` infrastructure.
+> These changes would benefit every consumer of the pi engine, not just GSD.
+
+---
+
+## 1. Prompt Caching (`cache_control`) — Highest Impact
+
+**Current state**: Every LLM call re-pays full input token cost for the system prompt, tool definitions, and context files. No `cache_control` breakpoints are set anywhere in the API call path.
+
+**Opportunity**: Anthropic's KV cache delivers 90% cost reduction on cached tokens (0.1x input rate). Claude Code achieves 92–98% cache hit rates by placing stable content before volatile content.
+
+**Where to instrument** (`packages/pi-ai/src/providers/anthropic.ts`):
+- Set `cache_control: { type: "ephemeral" }` on the last tool definition block
+- Set `cache_control` after the static system prompt sections (base boilerplate + context files)
+- Leave the per-turn user message uncached
+
+**Critical constraint**: The cache breakpoint must be placed *after* all static content and *before* any dynamic content (timestamps, per-request variables). Moving a timestamp before a cache breakpoint defeats it on every call.
+
+**Cache hierarchy**: Tools → system → messages. Changing a tool definition invalidates system and message caches. Tool definitions should be sorted deterministically (alphabetically) to prevent spurious cache misses.
+
+**Expected savings**: 80–90% reduction in input token cost for multi-turn sessions (the dominant cost pattern in GSD auto-mode).
+
+---
+
+## 2. Observation Masking in the Message Pipeline
+
+**Current state**: `agent-loop.ts` passes the full `context.messages` array to the LLM on every turn. Tool results from 50 turns ago are re-read in full on every subsequent call. The `transformContext` hook exists on `AgentContext` and fires before every LLM call, but has no default implementation — extensions are responsible for any pruning.
+
+**Opportunity**: Replace old tool result content with lightweight placeholders after N turns. JetBrains Research tested this on SWE-bench Verified (500 tasks, up to 250-turn trajectories) and found:
+- 50%+ cost reduction vs. unmanaged history
+- Performance matched or slightly exceeded LLM summarization
+- Zero overhead (no extra LLM call required)
+
+**Proposed implementation** (default `transformContext` in `pi-agent-core`):
+```typescript
+// Keep last KEEP_RECENT_TURNS verbatim; mask older tool results
+const KEEP_RECENT_TURNS = 8;
+
+function defaultObservationMask(messages: AgentMessage[]): AgentMessage[] {
+  const cutoff = findTurnBoundary(messages, KEEP_RECENT_TURNS);
+  return messages.map((m, i) => {
+    if (i >= cutoff) return m;
+    if (m.type === "toolResult" || m.type === "bashExecution") {
+      return { ...m, content: "[result masked — within summarized history]", excludeFromContext: false };
+    }
+    return m;
+  });
+}
+```
+
+**Compaction interaction**: Observation masking reduces the token accumulation rate, pushing the compaction threshold further out. The two mechanisms are complementary — masking handles the steady state, compaction handles the rare deep-session case.
+
+---
+
+## 3. Earlier Compaction Threshold
+
+**Current state** (`packages/pi-coding-agent/src/core/constants.ts`):
+```typescript
+COMPACTION_RESERVE_TOKENS = 16_384   // triggers at contextWindow - 16K
+COMPACTION_KEEP_RECENT_TOKENS = 20_000
+```
+
+For a 200K context window, compaction fires at ~183K tokens — 91.5% utilization.
+
+**Problem**: Context drift (not raw exhaustion) causes ~65% of enterprise agent failures. Performance degrades measurably beyond ~30K tokens per Zylos production data. The current threshold lets sessions run degraded for a long stretch before compaction fires.
+
+**Opportunity**: Lower the trigger to 70% utilization. For a 200K window, this means compacting at ~140K tokens — 43K tokens earlier.
+
+```typescript
+// Proposed
+COMPACTION_THRESHOLD_PERCENT = 0.70   // fire at 70% of contextWindow
+COMPACTION_RESERVE_TOKENS = contextWindow * (1 - COMPACTION_THRESHOLD_PERCENT)
+```
+
+**Trade-off**: More frequent compactions, each happening earlier when there's more "fresh" content to keep. Summary quality improves because less material needs to be discarded at each cut.
+
+---
+
+## 4. Tool Result Truncation at Write Time
+
+**Current state**: `TOOL_RESULT_MAX_CHARS = 2_000` in `constants.ts`, but this limit is only applied *during compaction summarization*, not when the tool result enters the message store. A bash result returning 50KB of log output is stored and re-sent verbatim until compaction fires.
+
+**Opportunity**: Truncate at write time in `messages.ts` → `convertToLlm()` or in the tool result handler. Two strategies:
+
+- **Hard truncation**: Slice at N chars, append `"\n[truncated — {original_length} chars]"`. Simple, zero overhead.
+- **Semantic head/tail**: Keep first 500 chars (context, command echo) + last 1000 chars (final output, errors). Better for bash results where the end contains the error.
+
+**Recommendation**: Semantic head/tail as the default, configurable per tool type. File read results benefit from head; bash/test output benefits from head+tail.
+
+---
+
+## 5. Context File Deduplication and Trim
+
+**Current state** (`packages/pi-coding-agent/src/core/resource-loader.ts`, lines 84–109):
+- Searches from `~/.gsd/agent/` → ancestor dirs → cwd
+- Deduplicates by *file path* but not by *content*
+- Entire file content concatenated verbatim into system prompt — no trimming, no summarization
+
+**Anti-pattern**: A project with AGENTS.md at 3 ancestor levels (repo root, workspace, home) injects all three in full. If they share common boilerplate, that content is re-injected multiple times.
+
+**Opportunities**:
+1. **Content deduplication**: Hash paragraph-level chunks; skip any chunk already seen in a previously-loaded file
+2. **Section-aware loading**: Parse `## ` headings in AGENTS.md; only include sections relevant to the current task type (e.g., `## Testing` section only when running tests)
+3. **Token budget enforcement**: If total context files exceed N tokens, summarize oldest/most-distant file rather than including verbatim
+
+---
+
+## 6. Skill Content Lazy Loading and Summarization
+
+**Current state**: When `/skill:name` is invoked, the full skill file content is injected inline as `<skill>...</skill>` in the user message. No chunking, no summarization. A 10KB skill file adds ~2,500 tokens to that turn.
+
+**Opportunity**:
+- **Cached skill injection**: If the same skill is used across multiple turns (rare but possible), it's re-injected each time. Cache with `cache_control` after first injection.
+- **Skill digest mode**: Inject a 200-token summary of the skill on first reference; full content only if the model requests it via a `get_skill_detail` tool call. Reduces cost for skills that don't end up being followed.
+- **Skill prefetching**: Before a known long session (e.g., auto-mode start), pre-inject all likely skills with `cache_control` so they're cached for the entire session.
+
+---
+
+## 7. Token Estimation Accuracy
+
+**Current state** (`compaction.ts`, line 216): `chars / 4` heuristic. This overestimates token count for English prose (~3.5 chars/token) and underestimates for code with short identifiers or Unicode.
+
+**Opportunity**: Use a proper tokenizer.
+- `@anthropic-ai/tokenizer` (tiktoken-compatible, ships with the SDK) — accurate but ~5ms per call
+- Tiered approach: use chars/4 for display; use proper tokenizer only for compaction threshold decisions (where accuracy matters)
+
+**Impact**: More accurate compaction timing, fewer unnecessary compactions, slightly better `COMPACTION_KEEP_RECENT_TOKENS` boundary placement.
+
+---
+
+## 8. Format: Markdown over XML for Internal Context
+
+**Current state**: The message pipeline uses `<skill>`, `<summary>`, `<compaction>` XML wrappers in several places. System prompt sections are largely prose Markdown.
+
+**Findings**: XML tags carry 15–40% more tokens than equivalent Markdown for the same semantic content, due to paired open/close tags. However, Claude was optimized for XML and shows higher accuracy on tasks requiring precise section parsing.
+
+**Recommendation**: Audit XML usage in the pipeline and convert to Markdown where the content is:
+- Non-nested (flat instructions, status messages)
+- Human-readable rather than machine-parsed by the model
+- Not requiring precise boundary detection
+
+Keep XML for: few-shot examples with ambiguous boundaries, skill content (requires precise isolation from surrounding text), compaction summaries that the model must treat as authoritative history.
+
+**Estimated savings**: 5–15% reduction in system prompt token count.
+
+---
+
+## 9. Dynamic Tool Set Delivery
+
+**Current state**: All tool definitions are included in every LLM request. Tool descriptions consume 60–80% of input tokens in static configurations. As new extensions register tools, the baseline grows linearly.
+
+**Opportunity** (higher complexity): Implement the three-function Dynamic Toolset pattern:
+1. `search_tools(query)` — semantic search over tool catalog
+2. `describe_tools(ids[])` — fetch full schemas on demand
+3. `execute_tool(id, params)` — unchanged execution
+
+Speakeasy measured 91–97% token reduction with 100% task success rate. Trade-off: 2–3x more tool calls, ~50% longer wall time. Net cost dramatically lower.
+
+**Feasibility for pi**: The tool registry (`packages/pi-coding-agent/src/core/tool-registry.ts`) already stores tool metadata separately from definitions. The primary engineering work is the semantic search index and the `describe_tools` / `search_tools` tool implementations.
+
+---
+
+## 10. Cost Attribution and Per-Phase Reporting
+
+**Current state**: `SessionManager.getUsageTotals()` accumulates cost across the entire session. No per-phase or per-agent breakdown is stored. Cost visibility is limited to the footer total and `GSD_SHOW_TOKEN_COST=1` per-turn display.
+
+**Opportunity**: Emit structured cost events that extensions can subscribe to:
+```typescript
+interface CostCheckpointEvent {
+  type: "cost_checkpoint";
+  label: string;          // "discuss-phase", "execute-slice-3"
+  deltaTokens: Usage;     // tokens since last checkpoint
+  cumulativeTokens: Usage;
+  cumulativeCost: number;
+}
+```
+
+GSD extension could consume these events to surface per-milestone cost in `/gsd stats` and flag milestones that are disproportionately expensive — enabling budget-aware planning.
+
+---
+
+## Implementation Ordering (if pursued)
+
+| Priority | Item | Effort | Expected Impact |
+|----------|------|--------|-----------------|
+| 1 | Prompt caching (`cache_control`) | Low | 80–90% input cost reduction |
+| 2 | Earlier compaction threshold (70%) | Trivial | Reduces drift in long sessions |
+| 3 | Tool result truncation at write time | Low | Reduces context bloat between compactions |
+| 4 | Context file deduplication | Medium | Variable — high for multi-level AGENTS.md setups |
+| 5 | Observation masking (default `transformContext`) | Medium | 50%+ on long-running agents |
+| 6 | Token estimation (proper tokenizer) | Low | Accuracy improvement, minor cost impact |
+| 7 | Markdown over XML audit | Low | 5–15% system prompt reduction |
+| 8 | Skill caching with `cache_control` | Low | Meaningful for skill-heavy sessions |
+| 9 | Dynamic tool set delivery | High | 90%+ on large tool catalogs; major architecture change |
+| 10 | Per-phase cost attribution events | Medium | Visibility only; enables future budget routing |
--- a/docs/token-optimization.md
+++ b/docs/token-optimization.md
@ -262,15 +262,59 @@ PREFERENCES.md
       ├─ resolveProfileDefaults() → model defaults + phase skip defaults
       ├─ resolveInlineLevel() → standard
       │    └─ prompt builders gate context inclusion by level
-       └─ classifyUnitComplexity() → routes to execution/execution_simple model
-            ├─ task plan analysis (steps, files, signals)
-            ├─ unit type defaults
-            ├─ budget pressure adjustment
-            └─ adaptive learning from routing-history.json
+       ├─ classifyUnitComplexity() → routes to execution/execution_simple model
+       │    ├─ task plan analysis (steps, files, signals)
+       │    ├─ unit type defaults
+       │    ├─ budget pressure adjustment
+       │    ├─ adaptive learning from routing-history.json
+       │    └─ capability scoring (when capability_routing: true)
+       │         └─ 7-dimension model profiles × task requirement vectors
+       └─ context_management
+            ├─ observation masking (before_provider_request hook)
+            ├─ tool result truncation (tool_result_max_chars)
+            └─ phase handoff anchors (injected into prompt builders)
 ```

 The profile is resolved once and flows through the entire dispatch pipeline. Explicit preferences override profile defaults at every layer.

+## Observation Masking
+
+*Introduced in v2.59.0*
+
+During auto-mode sessions, tool results accumulate in the conversation history and consume context window space. Observation masking replaces tool result content older than N user turns with a lightweight placeholder before each LLM call. This reduces token usage with zero LLM overhead — no summarization calls, no latency.
+
+Masking is enabled by default during auto-mode. Configure via preferences:
+
+```yaml
+context_management:
+  observation_masking: true     # default: true (set false to disable)
+  observation_mask_turns: 8     # keep results from last 8 user turns (range: 1-50)
+  tool_result_max_chars: 800    # truncate individual tool results beyond this length
+```
+
+### How It Works
+
+1. Before each provider request, the `before_provider_request` hook inspects the messages array
+2. Tool results (`toolResult`, `bashExecution`) older than the configured turn threshold are replaced with `[result masked — within summarized history]`
+3. Recent tool results (within the keep window) are preserved in full
+4. All assistant and user messages are always preserved — only tool result content is masked
+
+This pairs with the existing compaction system: masking reduces context pressure between compactions, and compaction handles the full context reset when the window fills.
+
+### Tool Result Truncation
+
+Individual tool results that exceed `tool_result_max_chars` (default: 800) are truncated with a `…[truncated]` marker. This prevents a single large tool output from dominating the context window.
+
+## Phase Handoff Anchors
+
+*Introduced in v2.59.0*
+
+When auto-mode transitions between phases (research → planning → execution), structured JSON anchors are written to `.gsd/milestones/<mid>/anchors/<phase>.json`. Downstream prompt builders inject these anchors so the next phase inherits intent, decisions, blockers, and next steps without re-inferring from artifact files.
+
+This reduces context drift — the 65% of enterprise agent failures caused by agents losing track of prior decisions across phase boundaries.
+
+Anchors are written automatically after successful completion of `research-milestone`, `research-slice`, `plan-milestone`, and `plan-slice` units. No configuration needed.
+
 ## Prompt Compression

 *Introduced in v2.29.0*
--- a/src/resources/extensions/gsd/auto-model-selection.ts
+++ b/src/resources/extensions/gsd/auto-model-selection.ts
@ -9,7 +9,7 @@ import type { ExtensionAPI, ExtensionContext } from "@gsd/pi-coding-agent";
 import type { GSDPreferences } from "./preferences.js";
 import { resolveModelWithFallbacksForUnit, resolveDynamicRoutingConfig } from "./preferences.js";
 import type { ComplexityTier } from "./complexity-classifier.js";
-import { classifyUnitComplexity, tierLabel } from "./complexity-classifier.js";
+import { classifyUnitComplexity, tierLabel, extractTaskMetadata } from "./complexity-classifier.js";
 import { resolveModelForComplexity, escalateTier } from "./model-router.js";
 import { getLedger, getProjectTotals } from "./metrics.js";
 import { unitPhaseLabel } from "./auto-dashboard.js";
@ -107,7 +107,15 @@ export async function selectAndApplyModel(
          }
        }

-        const routingResult = resolveModelForComplexity(classification, modelConfig, routingConfig, availableModelIds);
+        // Extract task metadata for capability scoring
+        const taskMeta = unitType === "execute-task"
+          ? extractTaskMetadata(unitId, basePath)
+          : undefined;
+
+        const routingResult = resolveModelForComplexity(
+          classification, modelConfig, routingConfig, availableModelIds,
+          unitType, taskMeta,
+        );

        if (routingResult.wasDowngraded) {
          effectiveModelConfig = {
@ -115,8 +123,9 @@ export async function selectAndApplyModel(
            fallbacks: routingResult.fallbacks,
          };
          if (verbose) {
+            const method = routingResult.selectionMethod === "capability-scored" ? "capability-scored" : "tier-only";
            ctx.ui.notify(
-              `Dynamic routing [${tierLabel(classification.tier)}]: ${routingResult.modelId} (${classification.reason})`,
+              `Dynamic routing [${tierLabel(classification.tier)}]: ${routingResult.modelId} (${method} — ${classification.reason})`,
              "info",
            );
          }
--- a/src/resources/extensions/gsd/auto-prompts.ts
+++ b/src/resources/extensions/gsd/auto-prompts.ts
@ -26,6 +26,7 @@ import { existsSync } from "node:fs";
 import { computeBudgets, resolveExecutorContextWindow, truncateAtSectionBoundary } from "./context-budget.js";
 import { getPendingGates } from "./gsd-db.js";
 import { formatDecisionsCompact, formatRequirementsCompact } from "./structured-data-formatter.js";
+import { readPhaseAnchor, formatAnchorForPrompt } from "./phase-anchor.js";

 // ─── Preamble Cap ─────────────────────────────────────────────────────────────

@ -906,6 +907,11 @@ export async function buildPlanMilestonePrompt(mid: string, midTitle: string, ba
  const researchRel = relMilestoneFile(base, mid, "RESEARCH");

  const inlined: string[] = [];
+
+  // Inject phase handoff anchor from research phase (if available)
+  const researchAnchor = readPhaseAnchor(base, mid, "research-milestone");
+  if (researchAnchor) inlined.push(formatAnchorForPrompt(researchAnchor));
+
  inlined.push(await inlineFile(contextPath, contextRel, "Milestone Context"));
  const researchInline = await inlineFileOptional(researchPath, researchRel, "Milestone Research");
  if (researchInline) inlined.push(researchInline);
@ -1033,6 +1039,11 @@ export async function buildPlanSlicePrompt(
  const researchRel = relSliceFile(base, mid, sid, "RESEARCH");

  const inlined: string[] = [];
+
+  // Inject phase handoff anchor from research phase (if available)
+  const researchSliceAnchor = readPhaseAnchor(base, mid, "research-slice");
+  if (researchSliceAnchor) inlined.push(formatAnchorForPrompt(researchSliceAnchor));
+
  inlined.push(await inlineFile(roadmapPath, roadmapRel, "Milestone Roadmap"));
  const researchInline = await inlineFileOptional(researchPath, researchRel, "Slice Research");
  if (researchInline) inlined.push(researchInline);
@ -1100,6 +1111,9 @@ export async function buildExecuteTaskPrompt(
    : { level: level as InlineLevel | undefined };
  const inlineLevel = opts.level ?? resolveInlineLevel();

+  // Inject phase handoff anchor from planning phase (if available)
+  const planAnchor = readPhaseAnchor(base, mid, "plan-slice");
+
  const priorSummaries = opts.carryForwardPaths ?? await getPriorTaskSummaryPaths(mid, sid, tid, base);
  const priorLines = priorSummaries.length > 0
    ? priorSummaries.map(p => `- \`${p}\``).join("\n")
@ -1190,9 +1204,12 @@ export async function buildExecuteTaskPrompt(
    ? `### Runtime Context\nSource: \`.gsd/RUNTIME.md\`\n\n${runtimeContent.trim()}`
    : "";

+  const phaseAnchorSection = planAnchor ? formatAnchorForPrompt(planAnchor) : "";
+
  return loadPrompt("execute-task", {
    overridesSection,
    runtimeContext,
+    phaseAnchorSection,
    workingDirectory: base,
    milestoneId: mid, sliceId: sid, sliceTitle: sTitle, taskId: tid, taskTitle: tTitle,
    planPath: join(base, relSliceFile(base, mid, sid, "PLAN")),
--- a/src/resources/extensions/gsd/auto/phases.ts
+++ b/src/resources/extensions/gsd/auto/phases.ts
@ -1205,6 +1205,23 @@ export async function runUnitPhase(
    s.unitRecoveryCount.delete(`${unitType}/${unitId}`);
  }

+  // Write phase handoff anchor after successful research/planning completion
+  const anchorPhases = new Set(["research-milestone", "research-slice", "plan-milestone", "plan-slice"]);
+  if (artifactVerified && mid && anchorPhases.has(unitType)) {
+    try {
+      const { writePhaseAnchor } = await import("../phase-anchor.js");
+      writePhaseAnchor(s.basePath, mid, {
+        phase: unitType,
+        milestoneId: mid,
+        generatedAt: new Date().toISOString(),
+        intent: `Completed ${unitType} for ${unitId}`,
+        decisions: [],
+        blockers: [],
+        nextSteps: [],
+      });
+    } catch { /* non-fatal — anchor is advisory */ }
+  }
+
  deps.emitJournalEvent({ ts: new Date().toISOString(), flowId: ic.flowId, seq: ic.nextSeq(), eventType: "unit-end", data: { unitType, unitId, status: unitResult.status, artifactVerified, ...(unitResult.errorContext ? { errorContext: unitResult.errorContext } : {}) }, causedBy: { flowId: ic.flowId, seq: unitStartSeq } });

  return { action: "next", data: { unitStartedAt: s.currentUnit?.startedAt } };
--- a/src/resources/extensions/gsd/bootstrap/register-hooks.ts
+++ b/src/resources/extensions/gsd/bootstrap/register-hooks.ts
@ -263,13 +263,62 @@ export function registerHooks(pi: ExtensionAPI): void {
  });

  pi.on("before_provider_request", async (event) => {
-    const modelId = event.model?.id;
-    if (!modelId) return;
-    const { getEffectiveServiceTier, supportsServiceTier } = await import("../service-tier.js");
-    const tier = getEffectiveServiceTier();
-    if (!tier || !supportsServiceTier(modelId)) return;
    const payload = event.payload as Record<string, unknown> | null;
    if (!payload || typeof payload !== "object") return;
+
+    // ── Observation Masking ─────────────────────────────────────────────
+    // Replace old tool results with placeholders to reduce context bloat.
+    // Only active during auto-mode when context_management.observation_masking is enabled.
+    if (isAutoActive()) {
+      try {
+        const { loadEffectiveGSDPreferences } = await import("../preferences.js");
+        const prefs = loadEffectiveGSDPreferences();
+        const cmConfig = prefs?.preferences.context_management;
+
+        // Observation masking: replace old tool results with placeholders
+        if (cmConfig?.observation_masking !== false) {
+          const keepTurns = cmConfig?.observation_mask_turns ?? 8;
+          const { createObservationMask } = await import("../context-masker.js");
+          const mask = createObservationMask(keepTurns);
+          const messages = payload.messages;
+          if (Array.isArray(messages)) {
+            payload.messages = mask(messages);
+          }
+        }
+
+        // Tool result truncation: cap individual tool result content length.
+        // In pi-ai format, toolResult messages have role: "toolResult" and content: TextContent[].
+        // Creates new objects to avoid mutating shared conversation state.
+        const maxChars = cmConfig?.tool_result_max_chars ?? 800;
+        const msgs = payload.messages;
+        if (Array.isArray(msgs)) {
+          payload.messages = msgs.map((msg: Record<string, unknown>) => {
+            // Match toolResult messages (role: "toolResult", content is array of content blocks)
+            if (msg?.role === "toolResult" && Array.isArray(msg.content)) {
+              const blocks = msg.content as Array<Record<string, unknown>>;
+              const totalLen = blocks.reduce((sum: number, b) => sum + (typeof b.text === "string" ? b.text.length : 0), 0);
+              if (totalLen > maxChars) {
+                const truncated = blocks.map(b => {
+                  if (typeof b.text === "string" && b.text.length > maxChars) {
+                    return { ...b, text: b.text.slice(0, maxChars) + "\n…[truncated]" };
+                  }
+                  return b;
+                });
+                return { ...msg, content: truncated };
+              }
+            }
+            return msg;
+          });
+        }
+      } catch { /* non-fatal */ }
+    }
+
+    // ── Service Tier ────────────────────────────────────────────────────
+    const modelId = event.model?.id;
+    if (!modelId) return payload;
+    const { getEffectiveServiceTier, supportsServiceTier } = await import("../service-tier.js");
+    const tier = getEffectiveServiceTier();
+    if (!tier || !supportsServiceTier(modelId)) return payload;
    payload.service_tier = tier;
    return payload;
  });
--- a/src/resources/extensions/gsd/captures.ts
+++ b/src/resources/extensions/gsd/captures.ts
@ -15,7 +15,7 @@ import { gsdRoot } from "./paths.js";

 // ─── Types ────────────────────────────────────────────────────────────────────

-export type Classification = "quick-task" | "inject" | "defer" | "replan" | "note";
+export type Classification = "quick-task" | "inject" | "defer" | "replan" | "note" | "stop" | "backtrack";

 export interface CaptureEntry {
  id: string;
@ -42,7 +42,7 @@ export interface TriageResult {

 const CAPTURES_FILENAME = "CAPTURES.md";
 const VALID_CLASSIFICATIONS: readonly string[] = [
-  "quick-task", "inject", "defer", "replan", "note",
+  "quick-task", "inject", "defer", "replan", "note", "stop", "backtrack",
 ];

 // ─── Path Resolution ──────────────────────────────────────────────────────────
--- a/src/resources/extensions/gsd/complexity-classifier.ts
+++ b/src/resources/extensions/gsd/complexity-classifier.ts
@ -212,7 +212,7 @@ function analyzePlanComplexity(
 /**
 * Extract task metadata from the task plan file on disk.
 */
-function extractTaskMetadata(unitId: string, basePath: string): TaskMetadata {
+export function extractTaskMetadata(unitId: string, basePath: string): TaskMetadata {
  const meta: TaskMetadata = {};
  const { milestone: mid, slice: sid, task: tid } = parseUnitId(unitId);
  if (!mid || !sid || !tid) return meta;
--- a/src/resources/extensions/gsd/context-masker.ts
+++ b/src/resources/extensions/gsd/context-masker.ts
@ -0,0 +1,74 @@
+/**
+ * Observation masking for GSD auto-mode sessions.
+ *
+ * Replaces tool result content older than N turns with a placeholder.
+ * Reduces context bloat between compactions with zero LLM overhead.
+ * Preserves message ordering, roles, and all assistant/user messages.
+ *
+ * Operates on the pi-ai Message[] format (post-convertToLlm, pre-provider):
+ *   - toolResult messages: { role: "toolResult", content: TextContent[] }
+ *   - bash results are already converted to: { role: "user", content: [{type:"text",text:"..."}] }
+ *     and start with "Ran `" from bashExecutionToText.
+ */
+
+interface MaskableMessage {
+  role: string;
+  content: unknown;
+  type?: string;
+  [key: string]: unknown;
+}
+
+const MASK_PLACEHOLDER = "[result masked — within summarized history]";
+const MASK_CONTENT_BLOCK = [{ type: "text" as const, text: MASK_PLACEHOLDER }];
+
+function findTurnBoundary(messages: MaskableMessage[], keepRecentTurns: number): number {
+  let turnsSeen = 0;
+  for (let i = messages.length - 1; i >= 0; i--) {
+    const m = messages[i];
+    // In the LLM payload, genuine user turns have role "user".
+    // Tool results have role "toolResult" and are excluded by this check.
+    if (m.role === "user") {
+      // Skip bash-result user messages (converted from bashExecution) — these aren't real user turns
+      if (isBashResultUserMessage(m)) continue;
+      turnsSeen++;
+      if (turnsSeen >= keepRecentTurns) return i;
+    }
+  }
+  return 0;
+}
+
+/**
+ * Detect user messages that originated from bashExecution.
+ * After convertToLlm, these are {role: "user", content: [{type:"text", text:"Ran `cmd`\n..."}]}.
+ * The bashExecutionToText format always starts with "Ran `".
+ */
+function isBashResultUserMessage(m: MaskableMessage): boolean {
+  if (m.role !== "user" || !Array.isArray(m.content)) return false;
+  const first = m.content[0];
+  return first && typeof first === "object" && "text" in first &&
+    typeof first.text === "string" && first.text.startsWith("Ran `");
+}
+
+function isMaskableMessage(m: MaskableMessage): boolean {
+  // Tool result messages (role: "toolResult" in pi-ai format)
+  if (m.role === "toolResult") return true;
+  // Bash-result user messages (converted from bashExecution by convertToLlm)
+  if (isBashResultUserMessage(m)) return true;
+  return false;
+}
+
+export function createObservationMask(keepRecentTurns: number = 8) {
+  return (messages: MaskableMessage[]): MaskableMessage[] => {
+    const boundary = findTurnBoundary(messages, keepRecentTurns);
+    if (boundary === 0) return messages;
+
+    return messages.map((m, i) => {
+      if (i >= boundary) return m;
+      if (isMaskableMessage(m)) {
+        // Content may be string or array of content blocks — always replace with array
+        return { ...m, content: MASK_CONTENT_BLOCK };
+      }
+      return m;
+    });
+  };
+}
--- a/src/resources/extensions/gsd/docs/preferences-reference.md
+++ b/src/resources/extensions/gsd/docs/preferences-reference.md
@ -189,6 +189,13 @@ Setting `prefer_skills: []` does **not** disable skill discovery — it just mea
  - `budget_pressure`: boolean — downgrade model tier when budget is under pressure. Default: `true`.
  - `cross_provider`: boolean — allow routing across different providers. Default: `true`.
  - `hooks`: boolean — enable routing hooks. Default: `true`.
+  - `capability_routing`: boolean — enable capability-profile scoring for model selection within a tier. Requires `enabled: true`. Default: `false`.
+
+- `context_management`: configures context hygiene for auto-mode sessions. Keys:
+  - `observation_masking`: boolean — mask old tool results to reduce context bloat. Default: `true`.
+  - `observation_mask_turns`: number — keep this many recent turns verbatim (1-50). Default: `8`.
+  - `compaction_threshold_percent`: number — trigger compaction at this % of context window (0.5-0.95). Lower values fire compaction earlier, reducing drift. Default: `0.70`.
+  - `tool_result_max_chars`: number — max chars per tool result in GSD sessions (200-10000). Default: `800`.

 - `auto_visualize`: boolean — show a visualizer hint after each milestone completion in auto-mode. Default: `false`.

--- a/src/resources/extensions/gsd/model-router.ts
+++ b/src/resources/extensions/gsd/model-router.ts
@ -10,6 +10,7 @@ import type { ResolvedModelConfig } from "./preferences.js";

 export interface DynamicRoutingConfig {
  enabled?: boolean;
+  capability_routing?: boolean;    // default: false — enable capability profile scoring
  tier_models?: {
    light?: string;
    standard?: string;
@ -32,6 +33,12 @@ export interface RoutingDecision {
  wasDowngraded: boolean;
  /** Human-readable reason for this decision */
  reason: string;
+  /** How the model was selected. */
+  selectionMethod?: "tier-only" | "capability-scored";
+  /** Capability scores per model (when capability-scored). */
+  capabilityScores?: Record<string, number>;
+  /** Task requirement vector (when capability-scored). */
+  taskRequirements?: Partial<Record<string, number>>;
 }

 // ─── Known Model Tiers ───────────────────────────────────────────────────────
@ -114,6 +121,91 @@ const MODEL_COST_PER_1K_INPUT: Record<string, number> = {
  "deepseek-chat": 0.00014,
 };

+// ─── Capability Profiles (ADR-004 Phase 2) ──────────────────────────────────
+// 7-dimension profiles, 0–100 normalized. Models without a profile
+// score 50 uniformly — capability scoring is a no-op for them.
+
+export interface ModelCapabilities {
+  coding: number;
+  debugging: number;
+  research: number;
+  reasoning: number;
+  speed: number;
+  longContext: number;
+  instruction: number;
+}
+
+export const MODEL_CAPABILITY_PROFILES: Record<string, ModelCapabilities> = {
+  "claude-opus-4-6":     { coding: 95, debugging: 90, research: 85, reasoning: 95, speed: 30, longContext: 80, instruction: 90 },
+  "claude-sonnet-4-6":   { coding: 85, debugging: 80, research: 75, reasoning: 80, speed: 60, longContext: 75, instruction: 85 },
+  "claude-haiku-4-5":    { coding: 60, debugging: 50, research: 45, reasoning: 50, speed: 95, longContext: 50, instruction: 75 },
+  "gpt-4o":              { coding: 80, debugging: 75, research: 70, reasoning: 75, speed: 65, longContext: 70, instruction: 80 },
+  "gpt-4o-mini":         { coding: 55, debugging: 45, research: 40, reasoning: 45, speed: 90, longContext: 45, instruction: 70 },
+  "gemini-2.5-pro":      { coding: 75, debugging: 70, research: 85, reasoning: 75, speed: 55, longContext: 90, instruction: 75 },
+  "gemini-2.0-flash":    { coding: 50, debugging: 40, research: 50, reasoning: 40, speed: 95, longContext: 60, instruction: 65 },
+  "deepseek-chat":       { coding: 75, debugging: 65, research: 55, reasoning: 70, speed: 70, longContext: 55, instruction: 65 },
+  "o3":                  { coding: 80, debugging: 85, research: 80, reasoning: 92, speed: 25, longContext: 70, instruction: 85 },
+};
+
+const BASE_REQUIREMENTS: Record<string, Partial<Record<keyof ModelCapabilities, number>>> = {
+  "execute-task":       { coding: 0.9, instruction: 0.7, speed: 0.3 },
+  "research-milestone": { research: 0.9, longContext: 0.7, reasoning: 0.5 },
+  "research-slice":     { research: 0.9, longContext: 0.7, reasoning: 0.5 },
+  "plan-milestone":     { reasoning: 0.9, coding: 0.5 },
+  "plan-slice":         { reasoning: 0.9, coding: 0.5 },
+  "replan-slice":       { reasoning: 0.9, debugging: 0.6, coding: 0.5 },
+  "reassess-roadmap":   { reasoning: 0.9, research: 0.5 },
+  "complete-slice":     { instruction: 0.8, speed: 0.7 },
+  "run-uat":            { instruction: 0.7, speed: 0.8 },
+  "discuss-milestone":  { reasoning: 0.6, instruction: 0.7 },
+  "complete-milestone": { instruction: 0.8, reasoning: 0.5 },
+};
+
+/**
+ * Compute a task requirement vector from unit type and optional metadata.
+ */
+export function computeTaskRequirements(
+  unitType: string,
+  metadata?: { tags?: string[]; complexityKeywords?: string[]; fileCount?: number; estimatedLines?: number },
+): Partial<Record<keyof ModelCapabilities, number>> {
+  const base = { ...(BASE_REQUIREMENTS[unitType] ?? { reasoning: 0.5 }) };
+
+  if (unitType === "execute-task" && metadata) {
+    if (metadata.tags?.some(t => /^(docs?|readme|comment|config|typo|rename)$/i.test(t))) {
+      return { ...base, instruction: 0.9, coding: 0.3, speed: 0.7 };
+    }
+    if (metadata.complexityKeywords?.some(k => k === "concurrency" || k === "compatibility")) {
+      return { ...base, debugging: 0.9, reasoning: 0.8 };
+    }
+    if (metadata.complexityKeywords?.some(k => k === "migration" || k === "architecture")) {
+      return { ...base, reasoning: 0.9, coding: 0.8 };
+    }
+    if ((metadata.fileCount ?? 0) >= 6 || (metadata.estimatedLines ?? 0) >= 500) {
+      return { ...base, coding: 0.9, reasoning: 0.7 };
+    }
+  }
+
+  return base;
+}
+
+/**
+ * Score a model against a task requirement vector.
+ * Returns weighted average in range 0–100. Returns 50 for empty requirements.
+ */
+export function scoreModel(
+  capabilities: ModelCapabilities,
+  requirements: Partial<Record<keyof ModelCapabilities, number>>,
+): number {
+  let weightedSum = 0;
+  let weightSum = 0;
+  for (const [dim, weight] of Object.entries(requirements)) {
+    const capability = capabilities[dim as keyof ModelCapabilities] ?? 50;
+    weightedSum += weight * capability;
+    weightSum += weight;
+  }
+  return weightSum > 0 ? weightedSum / weightSum : 50;
+}
+
 // ─── Public API ──────────────────────────────────────────────────────────────

 /**
@ -132,6 +224,8 @@ export function resolveModelForComplexity(
  phaseConfig: ResolvedModelConfig | undefined,
  routingConfig: DynamicRoutingConfig,
  availableModelIds: string[],
+  unitType?: string,
+  metadata?: { tags?: string[]; complexityKeywords?: string[]; fileCount?: number; estimatedLines?: number },
 ): RoutingDecision {
  // If no phase config or routing disabled, pass through
  if (!phaseConfig || !routingConfig.enabled) {
@ -175,25 +269,40 @@ export function resolveModelForComplexity(
  }

  // Find the best model for the requested tier
-  const targetModelId = findModelForTier(
-    requestedTier,
-    routingConfig,
-    availableModelIds,
-    routingConfig.cross_provider !== false,
-  );
+  const useCapabilityScoring = routingConfig.capability_routing && unitType;
+
+  let targetModelId: string | null;
+  let capabilityScores: Record<string, number> | undefined;
+  let taskRequirements: Partial<Record<string, number>> | undefined;
+  let selectionMethod: "tier-only" | "capability-scored" = "tier-only";
+
+  if (useCapabilityScoring) {
+    const result = findModelForTierWithCapability(
+      requestedTier, routingConfig, availableModelIds,
+      routingConfig.cross_provider !== false, unitType, metadata,
+    );
+    targetModelId = result.modelId;
+    capabilityScores = Object.keys(result.scores).length > 0 ? result.scores : undefined;
+    taskRequirements = Object.keys(result.requirements).length > 0 ? result.requirements : undefined;
+    selectionMethod = capabilityScores ? "capability-scored" : "tier-only";
+  } else {
+    targetModelId = findModelForTier(
+      requestedTier, routingConfig, availableModelIds,
+      routingConfig.cross_provider !== false,
+    );
+  }

  if (!targetModelId) {
-    // No suitable model found — use configured primary
    return {
      modelId: configuredPrimary,
      fallbacks: phaseConfig.fallbacks,
      tier: requestedTier,
      wasDowngraded: false,
      reason: `no ${requestedTier}-tier model available`,
+      selectionMethod,
    };
  }

-  // Build fallback chain: [downgraded_model, ...configured_fallbacks, configured_primary]
  const fallbacks = [
    ...phaseConfig.fallbacks.filter(f => f !== targetModelId),
    configuredPrimary,
@ -205,6 +314,9 @@ export function resolveModelForComplexity(
    tier: requestedTier,
    wasDowngraded: true,
    reason: classification.reason,
+    selectionMethod,
+    capabilityScores,
+    taskRequirements,
  };
 }

@ -226,6 +338,7 @@ export function escalateTier(currentTier: ComplexityTier): ComplexityTier | null
 export function defaultRoutingConfig(): DynamicRoutingConfig {
  return {
    enabled: true,
+    capability_routing: false,
    escalate_on_failure: true,
    budget_pressure: true,
    cross_provider: true,
@ -298,6 +411,56 @@ function findModelForTier(
  return candidates[0] ?? null;
 }

+function findModelForTierWithCapability(
+  tier: ComplexityTier,
+  config: DynamicRoutingConfig,
+  availableModelIds: string[],
+  crossProvider: boolean,
+  unitType: string,
+  metadata?: { tags?: string[]; complexityKeywords?: string[]; fileCount?: number; estimatedLines?: number },
+): { modelId: string | null; scores: Record<string, number>; requirements: Partial<Record<string, number>> } {
+  const explicitModel = config.tier_models?.[tier];
+  if (explicitModel) {
+    const match = availableModelIds.find(id => {
+      const bareAvail = id.includes("/") ? id.split("/").pop()! : id;
+      const bareExplicit = explicitModel.includes("/") ? explicitModel.split("/").pop()! : explicitModel;
+      return bareAvail === bareExplicit || id === explicitModel;
+    });
+    if (match) return { modelId: match, scores: {}, requirements: {} };
+  }
+
+  const requirements = computeTaskRequirements(unitType, metadata);
+  const candidates = availableModelIds.filter(id => getModelTier(id) === tier);
+  if (candidates.length === 0) return { modelId: null, scores: {}, requirements };
+
+  const scores: Record<string, number> = {};
+  for (const id of candidates) {
+    const bareId = id.includes("/") ? id.split("/").pop()! : id;
+    const profile = getModelProfile(bareId);
+    scores[id] = scoreModel(profile, requirements);
+  }
+
+  candidates.sort((a, b) => {
+    const scoreDiff = scores[b] - scores[a];
+    if (Math.abs(scoreDiff) > 2) return scoreDiff;
+    if (crossProvider) {
+      const costDiff = getModelCost(a) - getModelCost(b);
+      if (costDiff !== 0) return costDiff;
+    }
+    return a.localeCompare(b);
+  });
+
+  return { modelId: candidates[0], scores, requirements };
+}
+
+function getModelProfile(bareId: string): ModelCapabilities {
+  if (MODEL_CAPABILITY_PROFILES[bareId]) return MODEL_CAPABILITY_PROFILES[bareId];
+  for (const [knownId, profile] of Object.entries(MODEL_CAPABILITY_PROFILES)) {
+    if (bareId.includes(knownId) || knownId.includes(bareId)) return profile;
+  }
+  return { coding: 50, debugging: 50, research: 50, reasoning: 50, speed: 50, longContext: 50, instruction: 50 };
+}
+
 function getModelCost(modelId: string): number {
  const bareId = modelId.includes("/") ? modelId.split("/").pop()! : modelId;

--- a/src/resources/extensions/gsd/phase-anchor.ts
+++ b/src/resources/extensions/gsd/phase-anchor.ts
@ -0,0 +1,71 @@
+/**
+ * Phase handoff anchors — compact structured summaries written between
+ * GSD auto-mode phases so downstream agents inherit decisions, blockers,
+ * and intent without re-inferring from scratch.
+ */
+
+import { existsSync, mkdirSync, readFileSync, writeFileSync } from "node:fs";
+import { join } from "node:path";
+import { gsdRoot } from "./paths.js";
+
+export interface PhaseAnchor {
+  phase: string;
+  milestoneId: string;
+  generatedAt: string;
+  intent: string;
+  decisions: string[];
+  blockers: string[];
+  nextSteps: string[];
+}
+
+function anchorsDir(basePath: string, milestoneId: string): string {
+  return join(gsdRoot(basePath), "milestones", milestoneId, "anchors");
+}
+
+function anchorPath(basePath: string, milestoneId: string, phase: string): string {
+  return join(anchorsDir(basePath, milestoneId), `${phase}.json`);
+}
+
+export function writePhaseAnchor(basePath: string, milestoneId: string, anchor: PhaseAnchor): void {
+  const dir = anchorsDir(basePath, milestoneId);
+  if (!existsSync(dir)) {
+    mkdirSync(dir, { recursive: true });
+  }
+  writeFileSync(anchorPath(basePath, milestoneId, anchor.phase), JSON.stringify(anchor, null, 2), "utf-8");
+}
+
+export function readPhaseAnchor(basePath: string, milestoneId: string, phase: string): PhaseAnchor | null {
+  const path = anchorPath(basePath, milestoneId, phase);
+  if (!existsSync(path)) return null;
+  try {
+    return JSON.parse(readFileSync(path, "utf-8")) as PhaseAnchor;
+  } catch {
+    return null;
+  }
+}
+
+export function formatAnchorForPrompt(anchor: PhaseAnchor): string {
+  const lines: string[] = [
+    `## Handoff from ${anchor.phase}`,
+    "",
+    `**Intent:** ${anchor.intent}`,
+  ];
+
+  if (anchor.decisions.length > 0) {
+    lines.push("", "**Decisions:**");
+    for (const d of anchor.decisions) lines.push(`- ${d}`);
+  }
+
+  if (anchor.blockers.length > 0) {
+    lines.push("", "**Blockers:**");
+    for (const b of anchor.blockers) lines.push(`- ${b}`);
+  }
+
+  if (anchor.nextSteps.length > 0) {
+    lines.push("", "**Next steps:**");
+    for (const s of anchor.nextSteps) lines.push(`- ${s}`);
+  }
+
+  lines.push("", "---");
+  return lines.join("\n");
+}
--- a/src/resources/extensions/gsd/preferences-types.ts
+++ b/src/resources/extensions/gsd/preferences-types.ts
@ -21,6 +21,13 @@ import type {
  GateEvaluationConfig,
 } from "./types.js";
 import type { DynamicRoutingConfig } from "./model-router.js";
+
+export interface ContextManagementConfig {
+  observation_masking?: boolean;          // default: true
+  observation_mask_turns?: number;        // default: 8, range: 1-50
+  compaction_threshold_percent?: number;  // default: 0.70, range: 0.5-0.95
+  tool_result_max_chars?: number;         // default: 800, range: 200-10000
+}
 import type { GitHubSyncConfig } from "../github-sync/types.js";

 // ─── Workflow Modes ──────────────────────────────────────────────────────────
@ -94,6 +101,7 @@ export const KNOWN_PREFERENCE_KEYS = new Set<string>([
  "forensics_dedup",
  "show_token_cost",
  "stale_commit_threshold_minutes",
+  "context_management",
  "experimental",
 ]);

@ -227,6 +235,7 @@ export interface GSDPreferences {
  post_unit_hooks?: PostUnitHookConfig[];
  pre_dispatch_hooks?: PreDispatchHookConfig[];
  dynamic_routing?: DynamicRoutingConfig;
+  context_management?: ContextManagementConfig;
  token_profile?: TokenProfile;
  phases?: PhaseSkipPreferences;
  auto_visualize?: boolean;
--- a/src/resources/extensions/gsd/preferences-validation.ts
+++ b/src/resources/extensions/gsd/preferences-validation.ts
@ -428,6 +428,10 @@ export function validatePreferences(preferences: GSDPreferences): {
        if (typeof dr.hooks === "boolean") validDr.hooks = dr.hooks;
        else errors.push("dynamic_routing.hooks must be a boolean");
      }
+      if (dr.capability_routing !== undefined) {
+        if (typeof dr.capability_routing === "boolean") validDr.capability_routing = dr.capability_routing;
+        else errors.push("dynamic_routing.capability_routing must be a boolean");
+      }
      if (dr.tier_models !== undefined) {
        if (typeof dr.tier_models === "object" && dr.tier_models !== null) {
          const tm = dr.tier_models as Record<string, unknown>;
@ -452,6 +456,40 @@ export function validatePreferences(preferences: GSDPreferences): {
    }
  }

+  // ─── Context Management ──────────────────────────────────────────────
+  if (preferences.context_management !== undefined) {
+    if (typeof preferences.context_management === "object" && preferences.context_management !== null) {
+      const cm = preferences.context_management as unknown as Record<string, unknown>;
+      const validCm: Record<string, unknown> = {};
+
+      if (cm.observation_masking !== undefined) {
+        if (typeof cm.observation_masking === "boolean") validCm.observation_masking = cm.observation_masking;
+        else errors.push("context_management.observation_masking must be a boolean");
+      }
+      if (cm.observation_mask_turns !== undefined) {
+        const turns = cm.observation_mask_turns;
+        if (typeof turns === "number" && turns >= 1 && turns <= 50) validCm.observation_mask_turns = turns;
+        else errors.push("context_management.observation_mask_turns must be a number between 1 and 50");
+      }
+      if (cm.compaction_threshold_percent !== undefined) {
+        const pct = cm.compaction_threshold_percent;
+        if (typeof pct === "number" && pct >= 0.5 && pct <= 0.95) validCm.compaction_threshold_percent = pct;
+        else errors.push("context_management.compaction_threshold_percent must be a number between 0.5 and 0.95");
+      }
+      if (cm.tool_result_max_chars !== undefined) {
+        const chars = cm.tool_result_max_chars;
+        if (typeof chars === "number" && chars >= 200 && chars <= 10000) validCm.tool_result_max_chars = chars;
+        else errors.push("context_management.tool_result_max_chars must be a number between 200 and 10000");
+      }
+
+      if (Object.keys(validCm).length > 0) {
+        validated.context_management = validCm as any;
+      }
+    } else {
+      errors.push("context_management must be an object");
+    }
+  }
+
  // ─── Parallel Config ────────────────────────────────────────────────────
  if (preferences.parallel && typeof preferences.parallel === "object") {
    const p = preferences.parallel as unknown as Record<string, unknown>;
--- a/src/resources/extensions/gsd/prompts/execute-task.md
+++ b/src/resources/extensions/gsd/prompts/execute-task.md
@ -12,6 +12,8 @@ A researcher explored the codebase and a planner decomposed the work — you are

 {{runtimeContext}}

+{{phaseAnchorSection}}
+
 {{resumeSection}}

 {{carryForwardSection}}
--- a/src/resources/extensions/gsd/tests/context-masker.test.ts
+++ b/src/resources/extensions/gsd/tests/context-masker.test.ts
@ -0,0 +1,122 @@
+import test from "node:test";
+import assert from "node:assert/strict";
+
+import { createObservationMask } from "../context-masker.js";
+
+// These helpers produce messages in the pi-ai LLM payload format
+// (post-convertToLlm, pre-provider), which is what before_provider_request sees.
+
+function userMsg(content: string) {
+  return { role: "user", content: [{ type: "text", text: content }] };
+}
+
+function assistantMsg(content: string) {
+  return { role: "assistant", content: [{ type: "text", text: content }] };
+}
+
+/** toolResult in pi-ai format: role "toolResult", content as TextContent[] */
+function toolResult(text: string) {
+  return { role: "toolResult", content: [{ type: "text", text }], toolCallId: "toolu_test", toolName: "Read", isError: false };
+}
+
+/** bashExecution after convertToLlm: becomes a user message with "Ran `cmd`" prefix */
+function bashResult(text: string) {
+  return { role: "user", content: [{ type: "text", text: `Ran \`echo test\`\n\`\`\`\n${text}\n\`\`\`` }] };
+}
+
+const MASK_TEXT = "[result masked — within summarized history]";
+
+test("masks nothing when message count is within keepRecentTurns", () => {
+  const mask = createObservationMask(8);
+  const messages = [
+    userMsg("hello"),
+    assistantMsg("hi"),
+    toolResult("file contents"),
+  ];
+  const result = mask(messages as any);
+  assert.equal(result.length, 3);
+  assert.deepEqual((result[2].content as any)[0].text, "file contents");
+});
+
+test("masks tool results older than keepRecentTurns", () => {
+  const mask = createObservationMask(2);
+  const messages = [
+    userMsg("turn 1"),
+    toolResult("old tool output"),
+    assistantMsg("response 1"),
+    userMsg("turn 2"),
+    toolResult("newer tool output"),
+    assistantMsg("response 2"),
+    userMsg("turn 3"),
+    toolResult("newest tool output"),
+    assistantMsg("response 3"),
+  ];
+  const result = mask(messages as any);
+  // Old tool result (before boundary) should be masked
+  assert.equal((result[1].content as any)[0].text, MASK_TEXT);
+  // Recent tool results (within keep window) should be preserved
+  assert.equal((result[4].content as any)[0].text, "newer tool output");
+  assert.equal((result[7].content as any)[0].text, "newest tool output");
+});
+
+test("never masks assistant messages", () => {
+  const mask = createObservationMask(1);
+  const messages = [
+    userMsg("turn 1"),
+    assistantMsg("old reasoning"),
+    userMsg("turn 2"),
+    assistantMsg("new reasoning"),
+  ];
+  const result = mask(messages as any);
+  assert.equal((result[1].content as any)[0].text, "old reasoning");
+  assert.equal((result[3].content as any)[0].text, "new reasoning");
+});
+
+test("never masks user messages", () => {
+  const mask = createObservationMask(1);
+  const messages = [
+    userMsg("old user message"),
+    assistantMsg("response"),
+    userMsg("new user message"),
+    assistantMsg("response"),
+  ];
+  const result = mask(messages as any);
+  assert.equal((result[0].content as any)[0].text, "old user message");
+});
+
+test("masks bash result user messages", () => {
+  const mask = createObservationMask(1);
+  const messages = [
+    userMsg("turn 1"),
+    bashResult("huge log output"),
+    assistantMsg("response 1"),
+    userMsg("turn 2"),
+    assistantMsg("response 2"),
+  ];
+  const result = mask(messages as any);
+  assert.equal((result[1].content as any)[0].text, MASK_TEXT);
+});
+
+test("returns same array length", () => {
+  const mask = createObservationMask(1);
+  const messages = [
+    userMsg("a"), toolResult("b"), assistantMsg("c"),
+    userMsg("d"), toolResult("e"), assistantMsg("f"),
+  ];
+  const result = mask(messages as any);
+  assert.equal(result.length, messages.length);
+});
+
+test("masks toolResult by role, not by type field", () => {
+  const mask = createObservationMask(1);
+  const messages = [
+    userMsg("turn 1"),
+    // This is the actual pi-ai format: role "toolResult", no type field
+    { role: "toolResult", content: [{ type: "text", text: "old result" }], toolCallId: "t1", toolName: "Read", isError: false },
+    assistantMsg("response 1"),
+    userMsg("turn 2"),
+    assistantMsg("response 2"),
+  ];
+  const result = mask(messages as any);
+  assert.equal((result[1].content as any)[0].text, MASK_TEXT);
+});
--- a/src/resources/extensions/gsd/tests/model-router.test.ts
+++ b/src/resources/extensions/gsd/tests/model-router.test.ts
@ -5,8 +5,11 @@ import {
  resolveModelForComplexity,
  escalateTier,
  defaultRoutingConfig,
+  scoreModel,
+  computeTaskRequirements,
+  MODEL_CAPABILITY_PROFILES,
 } from "../model-router.js";
-import type { DynamicRoutingConfig, RoutingDecision } from "../model-router.js";
+import type { DynamicRoutingConfig, RoutingDecision, ModelCapabilities } from "../model-router.js";
 import type { ClassificationResult } from "../complexity-classifier.js";

 // ─── Helpers ─────────────────────────────────────────────────────────────────
@ -206,6 +209,89 @@ test("#2192: known model is still downgraded normally", () => {
  assert.notEqual(result.modelId, "claude-opus-4-6");
 });

+// ─── Capability Scoring (ADR-004 Phase 2) ───────────────────────────────────
+
+test("defaultRoutingConfig includes capability_routing: false", () => {
+  const config = defaultRoutingConfig();
+  assert.equal(config.capability_routing, false);
+});
+
+test("scoreModel computes weighted average of capability × requirement", () => {
+  const caps: ModelCapabilities = {
+    coding: 90, debugging: 80, research: 70,
+    reasoning: 85, speed: 50, longContext: 60, instruction: 75,
+  };
+  const reqs = { coding: 0.9, reasoning: 0.5 };
+  const score = scoreModel(caps, reqs);
+  // Expected: (0.9*90 + 0.5*85) / (0.9 + 0.5) = (81 + 42.5) / 1.4 = 88.21...
+  assert.ok(Math.abs(score - 88.21) < 0.1, `score ${score} should be ~88.21`);
+});
+
+test("scoreModel returns 50 for empty requirements", () => {
+  const caps: ModelCapabilities = {
+    coding: 90, debugging: 80, research: 70,
+    reasoning: 85, speed: 50, longContext: 60, instruction: 75,
+  };
+  const score = scoreModel(caps, {});
+  assert.equal(score, 50);
+});
+
+test("computeTaskRequirements returns base vector for known unit type", () => {
+  const reqs = computeTaskRequirements("execute-task");
+  assert.ok(reqs.coding !== undefined && reqs.coding > 0);
+});
+
+test("computeTaskRequirements boosts instruction for docs-tagged tasks", () => {
+  const reqs = computeTaskRequirements("execute-task", { tags: ["docs"] });
+  assert.ok((reqs.instruction ?? 0) >= 0.8);
+  assert.ok((reqs.coding ?? 1) <= 0.4);
+});
+
+test("computeTaskRequirements returns generic vector for unknown unit type", () => {
+  const reqs = computeTaskRequirements("unknown-unit");
+  assert.ok(reqs.reasoning !== undefined);
+});
+
+test("resolveModelForComplexity uses capability scoring when enabled", () => {
+  const config: DynamicRoutingConfig = {
+    ...defaultRoutingConfig(),
+    enabled: true,
+    capability_routing: true,
+  };
+  const result = resolveModelForComplexity(
+    makeClassification("light"),
+    { primary: "claude-opus-4-6", fallbacks: [] },
+    config,
+    ["claude-opus-4-6", "claude-haiku-4-5", "gpt-4o-mini"],
+    "execute-task",
+  );
+  assert.equal(result.wasDowngraded, true);
+  assert.equal(result.selectionMethod, "capability-scored");
+});
+
+test("resolveModelForComplexity falls back to tier-only when capability_routing is false", () => {
+  const config: DynamicRoutingConfig = {
+    ...defaultRoutingConfig(),
+    enabled: true,
+    capability_routing: false,
+  };
+  const result = resolveModelForComplexity(
+    makeClassification("light"),
+    { primary: "claude-opus-4-6", fallbacks: [] },
+    config,
+    ["claude-opus-4-6", "claude-haiku-4-5", "gpt-4o-mini"],
+  );
+  assert.equal(result.wasDowngraded, true);
+  assert.ok(!result.selectionMethod || result.selectionMethod === "tier-only");
+});
+
+test("MODEL_CAPABILITY_PROFILES has entries for core models", () => {
+  const profiledModels = Object.keys(MODEL_CAPABILITY_PROFILES);
+  assert.ok(profiledModels.length >= 9, `Expected ≥9 profiles, got ${profiledModels.length}`);
+  assert.ok(MODEL_CAPABILITY_PROFILES["claude-opus-4-6"]);
+  assert.ok(MODEL_CAPABILITY_PROFILES["claude-haiku-4-5"]);
+});
+
 // ─── #2885: openai-codex and modern OpenAI models in tier map ────────────────

 test("#2885: openai-codex light-tier models are recognized", () => {
--- a/src/resources/extensions/gsd/tests/phase-anchor.test.ts
+++ b/src/resources/extensions/gsd/tests/phase-anchor.test.ts
@ -0,0 +1,83 @@
+import test from "node:test";
+import assert from "node:assert/strict";
+import { mkdtempSync, mkdirSync, rmSync, existsSync } from "node:fs";
+import { join } from "node:path";
+import { tmpdir } from "node:os";
+
+import { writePhaseAnchor, readPhaseAnchor, formatAnchorForPrompt } from "../phase-anchor.js";
+import type { PhaseAnchor } from "../phase-anchor.js";
+
+function makeTempBase(): string {
+  const tmp = mkdtempSync(join(tmpdir(), "gsd-anchor-test-"));
+  mkdirSync(join(tmp, ".gsd", "milestones", "M001", "anchors"), { recursive: true });
+  return tmp;
+}
+
+test("writePhaseAnchor creates anchor file in correct location", () => {
+  const base = makeTempBase();
+  try {
+    const anchor: PhaseAnchor = {
+      phase: "discuss",
+      milestoneId: "M001",
+      generatedAt: new Date().toISOString(),
+      intent: "Define authentication requirements",
+      decisions: ["Use JWT tokens", "Session expiry 24h"],
+      blockers: [],
+      nextSteps: ["Plan the implementation slices"],
+    };
+    writePhaseAnchor(base, "M001", anchor);
+    assert.ok(existsSync(join(base, ".gsd", "milestones", "M001", "anchors", "discuss.json")));
+  } finally {
+    rmSync(base, { recursive: true, force: true });
+  }
+});
+
+test("readPhaseAnchor returns written anchor", () => {
+  const base = makeTempBase();
+  try {
+    const anchor: PhaseAnchor = {
+      phase: "plan",
+      milestoneId: "M001",
+      generatedAt: new Date().toISOString(),
+      intent: "Break work into slices",
+      decisions: ["3 slices: auth, UI, tests"],
+      blockers: ["Need DB schema first"],
+      nextSteps: ["Execute S01"],
+    };
+    writePhaseAnchor(base, "M001", anchor);
+    const read = readPhaseAnchor(base, "M001", "plan");
+    assert.ok(read);
+    assert.equal(read!.intent, "Break work into slices");
+    assert.deepEqual(read!.decisions, ["3 slices: auth, UI, tests"]);
+    assert.deepEqual(read!.blockers, ["Need DB schema first"]);
+  } finally {
+    rmSync(base, { recursive: true, force: true });
+  }
+});
+
+test("readPhaseAnchor returns null when no anchor exists", () => {
+  const base = makeTempBase();
+  try {
+    const read = readPhaseAnchor(base, "M001", "discuss");
+    assert.equal(read, null);
+  } finally {
+    rmSync(base, { recursive: true, force: true });
+  }
+});
+
+test("formatAnchorForPrompt produces markdown block", () => {
+  const anchor: PhaseAnchor = {
+    phase: "discuss",
+    milestoneId: "M001",
+    generatedAt: "2026-04-03T00:00:00.000Z",
+    intent: "Define requirements",
+    decisions: ["Use JWT"],
+    blockers: [],
+    nextSteps: ["Plan slices"],
+  };
+  const md = formatAnchorForPrompt(anchor);
+  assert.ok(md.includes("## Handoff from discuss"));
+  assert.ok(md.includes("Define requirements"));
+  assert.ok(md.includes("Use JWT"));
+  assert.ok(md.includes("Plan slices"));
+});
--- a/src/resources/extensions/gsd/triage-ui.ts
+++ b/src/resources/extensions/gsd/triage-ui.ts
@ -49,10 +49,18 @@ const CLASSIFICATION_LABELS: Record<Classification, { label: string; description
    label: "Note",
    description: "Informational only — no action needed.",
  },
+  "stop": {
+    label: "Stop",
+    description: "Halt current execution — a blocking issue requires resolution.",
+  },
+  "backtrack": {
+    label: "Backtrack",
+    description: "Undo recent steps and retry from an earlier checkpoint.",
+  },
 };

 const ALL_CLASSIFICATIONS: Classification[] = [
-  "quick-task", "inject", "defer", "replan", "note",
+  "quick-task", "inject", "defer", "replan", "note", "stop", "backtrack",
 ];

 // ─── Public API ───────────────────────────────────────────────────────────────