From 946eec3bd1c5e296badf5df704ad9a288ce2488a Mon Sep 17 00:00:00 2001 From: Jeremy Date: Thu, 26 Mar 2026 17:30:19 -0500 Subject: [PATCH] docs(01-05): update dynamic-model-routing.md with capability-aware routing features - Add Capability Profiles section: 7 dimensions, 9 built-in profiles, uniform-50 cold-start - Add How Scoring Works section: pipeline order, weighted average formula, task requirements table - Add User Overrides section: modelOverrides JSON example, deep-merge semantics - Update Configuration section: document capability_routing flag - Add Verbose Output section: scoring breakdown format, selectionMethod field - Add Extension Hook section: before_model_select payload, return value, first-override-wins --- docs/dynamic-model-routing.md | 180 +++++++++++++++++++++++++++++----- 1 file changed, 154 insertions(+), 26 deletions(-) diff --git a/docs/dynamic-model-routing.md b/docs/dynamic-model-routing.md index 9bbf125fe..bc88df2bd 100644 --- a/docs/dynamic-model-routing.md +++ b/docs/dynamic-model-routing.md @@ -1,12 +1,20 @@ # Dynamic Model Routing -*Introduced in v2.19.0* +*Introduced in v2.19.0. Capability scoring introduced in v2.52.0.* Dynamic model routing automatically selects cheaper models for simple work and reserves expensive models for complex tasks. This reduces token consumption by 20-50% on capped plans without sacrificing quality where it matters. +Starting in v2.52.0, the router uses **capability-aware scoring** to select the *best fit* model for each task, not just the cheapest one in the tier. + ## How It Works -Each unit dispatched by auto-mode is classified into a complexity tier: +Each unit dispatched by auto-mode passes through a two-stage pipeline: + +**Stage 1: Complexity classification** — classifies the work into a tier (light/standard/heavy). + +**Stage 2: Capability scoring** — within the eligible tier, ranks available models by how well their capabilities match the task's requirements. + +The key rule: **downgrade-only semantics**. The user's configured model is always the ceiling — routing never upgrades beyond what you've configured. | Tier | Typical Work | Default Model Level | |------|-------------|-------------------| @@ -14,8 +22,6 @@ Each unit dispatched by auto-mode is classified into a complexity tier: | **Standard** | Research, planning, execution, milestone completion | Sonnet-class | | **Heavy** | Replanning, roadmap reassessment, complex execution | Opus-class | -The router then selects a model for that tier. The key rule: **downgrade-only semantics**. The user's configured model is always the ceiling — routing never upgrades beyond what you've configured. - ## Enabling Dynamic routing is off by default. Enable it in preferences: @@ -41,6 +47,7 @@ dynamic_routing: budget_pressure: true # auto-downgrade when approaching budget ceiling (default: true) cross_provider: true # consider models from other providers (default: true) hooks: true # apply routing to post-unit hooks (default: true) + capability_routing: true # enable capability scoring within tier (default: true) ``` ### `tier_models` @@ -70,35 +77,156 @@ When approaching the budget ceiling, the router progressively downgrades: When enabled, the router may select models from providers other than your primary. This uses the built-in cost table to find the cheapest model at each tier. Requires the target provider to be configured. -## Capability-Aware Scoring +### `capability_routing` -*Introduced in v2.59.0 (ADR-004 Phase 2)* - -When `capability_routing` is enabled, the router goes beyond tier classification and scores models against task-specific capability requirements. Each known model has a 7-dimension profile: - -| Dimension | What It Measures | -|-----------|-----------------| -| `coding` | Code generation, refactoring, implementation quality | -| `debugging` | Error diagnosis, fix accuracy | -| `research` | Information gathering, codebase exploration | -| `reasoning` | Multi-step logic, architectural decisions | -| `speed` | Response latency (inverse of cost) | -| `longContext` | Performance with large context windows | -| `instruction` | Adherence to structured instructions and templates | - -Each unit type maps to a weighted requirement vector. For example, `execute-task` weights `coding: 0.9, reasoning: 0.6, debugging: 0.5` while `research-slice` weights `research: 0.9, reasoning: 0.7, longContext: 0.5`. - -For `execute-task` units, the classifier also inspects task metadata (tags, description) to refine requirements. Documentation tasks boost `instruction` and lower `coding`; test tasks boost `debugging`. - -Enable capability routing: +When enabled (default: true), the router uses capability scoring to pick the best model in a tier rather than always defaulting to the cheapest. Set to `false` to revert to cheapest-in-tier behavior: ```yaml dynamic_routing: enabled: true - capability_routing: true + capability_routing: false # disable scoring, use cheapest-in-tier ``` -When enabled, models within the target tier are ranked by capability score rather than selected arbitrarily. When disabled (the default), the existing tier-only selection applies. +## Capability Profiles + +Each model has a built-in **capability profile** — a 7-dimension score (0–100) representing how well it handles different task types: + +| Dimension | What It Represents | +|-----------|-------------------| +| `coding` | Code generation and implementation accuracy | +| `debugging` | Diagnosing and fixing errors | +| `research` | Synthesizing information and exploring topics | +| `reasoning` | Multi-step logical reasoning | +| `speed` | Latency and throughput (inverse of capability depth) | +| `longContext` | Handling large codebases and long documents | +| `instruction` | Following structured instructions precisely | + +**Built-in profiles** exist for 9 models: `claude-opus-4-6`, `claude-sonnet-4-6`, `claude-haiku-4-5`, `gpt-4o`, `gpt-4o-mini`, `gemini-2.5-pro`, `gemini-2.0-flash`, `deepseek-chat`, `o3`. + +Models without a built-in profile receive **uniform scores of 50** across all dimensions. This is a cold-start policy — unknown models compete but don't have an advantage. From the user's perspective, routing behaves the same as before capability scoring was introduced for those models. + +**Profiles are heuristic rankings, not benchmarks.** They represent approximate relative strengths, not verified benchmark results. Use user overrides (below) to correct them for models you know well. + +## How Scoring Works + +The routing pipeline within a tier: + +``` +classify complexity tier + ↓ +filter eligible models for tier + ↓ +fire before_model_select hook (optional override) + ↓ +capability score eligible models + ↓ +select winner (or first eligible if scoring is disabled) +``` + +**Scoring formula:** weighted average of capability dimensions + +``` +score = Σ(weight × capability) / Σ(weights) +``` + +**Task requirements** are dynamic — different task types weight dimensions differently: + +| Unit Type | Key Dimensions | +|-----------|---------------| +| `execute-task` | coding (0.9), instruction (0.7), speed (0.3) | +| `research-*` | research (0.9), longContext (0.7), reasoning (0.5) | +| `plan-*` | reasoning (0.9), coding (0.5) | +| `replan-slice` | reasoning (0.9), debugging (0.6), coding (0.5) | +| `complete-slice`, `run-uat` | instruction (0.8), speed (0.7) | + +For `execute-task`, requirements are further refined by task metadata signals: +- Tags like `docs`, `config`, `readme` → boost instruction weight +- Keywords like `concurrency`, `compatibility` → boost debugging and reasoning +- Keywords like `migration`, `architecture` → boost reasoning and coding +- Large file counts (≥6) or large estimated line counts (≥500) → boost coding and reasoning + +**Tie-breaking:** When two models score within 2 points of each other, the cheaper model wins. If costs are equal, lexicographic model ID breaks the tie (deterministic). + +## User Overrides + +Correct built-in capability profiles for models you know well using `modelOverrides` in your models configuration: + +```json +{ + "providers": { + "anthropic": { + "modelOverrides": { + "claude-sonnet-4-6": { + "capabilities": { + "debugging": 90, + "research": 85 + } + } + } + } + } +} +``` + +Overrides are **deep-merged** with built-in defaults — only the specified dimensions are overridden; others retain their built-in values. + +**Use case:** You've found that a model consistently outperforms its built-in profile on specific task types. Override the relevant dimensions to steer the router toward that model for those tasks. + +## Verbose Output + +When verbose mode is active, the router logs its routing decision. When capability scoring was used, the log includes a full scoring breakdown: + +``` +Dynamic routing [S]: claude-sonnet-4-6 (capability-scored) — claude-sonnet-4-6: 82.3, gpt-4o: 78.1, deepseek-chat: 72.0 +``` + +When tier-only routing was used (scoring disabled, single eligible model, or routing guards applied): + +``` +Dynamic routing [S]: claude-sonnet-4-6 (standard complexity, multiple steps) +``` + +The `selectionMethod` field in the routing decision indicates which path was taken: +- `"capability-scored"` — capability scoring selected the winner +- `"tier-only"` — cheapest in tier (or explicit pin) was used + +## Extension Hook + +Extensions can intercept and override model selection using the `before_model_select` hook. + +The hook fires **after** tier filtering (eligible models are known) and **before** capability scoring (scores have not been computed yet). A hook can override selection entirely or return `undefined` to let scoring proceed normally. + +**Registering a handler:** + +```typescript +pi.on("before_model_select", async (event) => { + const { unitType, unitId, classification, taskMetadata, eligibleModels, phaseConfig } = event; + + // Custom routing strategy: always use gemini for research tasks + if (unitType.startsWith("research-")) { + const gemini = eligibleModels.find(id => id.includes("gemini")); + if (gemini) return { modelId: gemini }; + } + + // Return undefined to let capability scoring proceed + return undefined; +}); +``` + +**Event payload:** + +| Field | Type | Description | +|-------|------|-------------| +| `unitType` | `string` | The unit type being dispatched (e.g., `"execute-task"`) | +| `unitId` | `string` | Unique identifier for this unit dispatch | +| `classification` | `{ tier, reason, downgraded }` | The complexity classification result | +| `taskMetadata` | `Record \| undefined` | Task metadata extracted from the unit plan | +| `eligibleModels` | `string[]` | Models eligible for the classified tier | +| `phaseConfig` | `{ primary, fallbacks } \| undefined` | The user's configured model for this phase | + +**Return value:** `{ modelId: string }` to override selection, or `undefined` to defer to capability scoring. + +**First-override-wins:** If multiple extensions register handlers, the first one to return a non-undefined result wins. Subsequent handlers are not called. ## Complexity Classification