From 55229f660443f553340a1e27cf87cb8c143a77d5 Mon Sep 17 00:00:00 2001 From: Mikael Hugo Date: Tue, 12 May 2026 23:55:02 +0200 Subject: [PATCH] fix(auto): split autonomous solver from executor per ADR-0079 - Lock solver model to kimi-k2.6 independent of unit-type router - Executor prompt no longer requires checkpoint tool call - Add dedicated solver pass that reads executor transcript and emits canonical checkpoint - Classify executor refusals as blocker outcomes (already partially implemented) - Classify no-op iterations (continue with zero work) as missing-checkpoint-retry - Add tests for executor prompt block, solver pass prompt, no-op detection, and no-op assessment Fixes sf-mp34nxb6-27zdx7 --- ...9-autonomous-solver-executor-separation.md | 159 ++++++++ .../extensions/sf/auto/phases-unit.js | 274 ++++++++++++- .../extensions/sf/autonomous-solver.js | 384 ++++++++++++++++-- src/resources/extensions/sf/solver-model.js | 119 ++++++ .../sf/tests/autonomous-solver.test.mjs | 369 +++++++++++++++++ 5 files changed, 1250 insertions(+), 55 deletions(-) create mode 100644 docs/adr/0079-autonomous-solver-executor-separation.md create mode 100644 src/resources/extensions/sf/solver-model.js diff --git a/docs/adr/0079-autonomous-solver-executor-separation.md b/docs/adr/0079-autonomous-solver-executor-separation.md new file mode 100644 index 000000000..6f7790a05 --- /dev/null +++ b/docs/adr/0079-autonomous-solver-executor-separation.md @@ -0,0 +1,159 @@ +# ADR-0079: Autonomous Solver / Executor Separation + +**Status:** Proposed +**Date:** 2026-05-12 +**Stakeholders:** Autonomous mode, model router, checkpoint protocol, runtime safety +**Related:** `.sf/self-feedback.jsonl` entry `sf-mp34nxb6-27zdx7` (architecture-defect:solver-executor-conflation) + +--- + +## Problem Statement + +Today the autonomous loop conflates two distinct roles into a single LLM call: + +1. **Executor** — does the unit work (read files, run tests, edit code). +2. **Autonomous solver** — observes what the executor produced and emits a canonical checkpoint to disk (`outcome`, `completedItems`, `remainingItems`, PDD, verification evidence). + +Both roles are filled by the same model, picked by `model-router.js:computeTaskRequirements` from the unit type (`execute-task`, `plan-slice`, …). The router optimizes for the *executor's* job — cost, coding capability, speed — and may select a small coding-tuned model (Codestral, Devstral, Gemini Flash). Those models are *not* required to be agentic, refusal-resistant, or stable at protocol reasoning. + +When the chosen model is incapable of the agentic role, the protocol breaks in a way the repair loop cannot fix: + +- **2026-05-12 M001-6377a4/S04/T02:** `mistral/codestral-latest` was routed to execute T02 (Align TUI Dashboard with Headless Status Output). It emitted: + > "I'm sorry, but I currently don't have the necessary tools to assist with that specific request." + + No tool was called. The runtime logged `Autonomous solver checkpoint missing … repair attempt 1/4 (mentioned-checkpoint-without-tool)`, then prompted the *same* Codestral with stronger "you MUST call the checkpoint tool" wording. Codestral dutifully called `Autonomous Checkpoint` with `outcome=continue` — and produced zero file edits, zero work. The protocol layer reported success; the slice made no progress. + +The repair logic at `auto/phases-unit.js:720-890` only enforces **protocol shape** ("did the LLM emit a checkpoint tool call?"). It does not check **outcome** ("did the unit progress?") or **refusal** ("did the executor refuse the task?"). And because executor and solver are the same call, retrying the repair just re-asks the broken model. + +## Goals + +1. The protocol layer must remain functional even when the executor refuses or is incapable. +2. Refusals must surface as blockers that can escalate model tier — not silently synthesize forward progress. +3. No-op iterations (continue with zero work) must not satisfy the repair gate. +4. Solver model choice must be stable and independent of unit-type routing. + +## Non-Goals + +- Replacing the model router for executors. Routing per `unitType` remains; cheap/specialized models are still desirable for unit work. +- Mandating a specific solver vendor. The locked solver model is a pinned default; ops may override via preferences. +- Reworking the checkpoint schema. The same JSON shape persists; only *who emits it* changes. + +## Proposed Architecture + +### Two-Layer Loop + +``` + ┌─────────────────────────────────────────┐ + │ runUnit(ctx, unitType, unitId, prompt) │ + └─────────────────────┬───────────────────┘ + │ + ┌───────────────────────┴───────────────────────┐ + │ │ + ▼ ▼ + ┌───────────────────────────┐ ┌───────────────────────────┐ + │ EXECUTOR PASS │ │ SOLVER PASS │ + │ model: routed per unit │ transcript → │ model: LOCKED kimi-k2.6 │ + │ (Codestral, Gemini, ...) │ ────────────────▶ │ reads agent_end messages, │ + │ does the unit work │ │ emits canonical checkpoint │ + │ NO checkpoint tool needed │ │ classifies refusal/no-op │ + └───────────────────────────┘ └─────────────┬─────────────┘ + │ + ▼ + ┌───────────────────────────┐ + │ appendAutonomousSolver- │ + │ Checkpoint(basePath, …) │ + └───────────────────────────┘ +``` + +### Solver Model Selection + +A new helper `resolveSolverModel(preferences)` returns the pinned solver model. It: + +- Defaults to `kimi-k2.6` (provider: `kimi-coding`). +- Allows preference override via `preferences.autonomousSolver.model` (operator escape hatch). +- **Never** consults the unit-type router, benchmark selector, Bayesian blender, or learning aggregator. The solver's model is a runtime invariant, not an optimization target. +- Falls back along a small explicit chain (`kimi-k2.6` → `claude-sonnet-4-6` → `claude-opus-4-7`) if the primary is unreachable. Falls back to "synthesize blocker" if none reachable, rather than silently dropping the protocol layer. + +### Solver Pass Contract + +Input: `{ unitType, unitId, executorTranscript, lastIteration, projection }`. + +Output (a checkpoint, written via `appendAutonomousSolverCheckpoint`): + +```json +{ + "outcome": "continue|complete|blocker", + "summary": "...", + "completedItems": [...], + "remainingItems": [...], + "verificationEvidence": [...], + "pdd": { "purpose": "...", "consumer": "...", ... }, + "classification": "executor-refused|executor-noop|progress|complete|blocker-...", + "evidence": "string excerpts proving the classification" +} +``` + +The solver's prompt is a deterministic template at `prompts/autonomous-solver.md` that: + +1. Embeds the executor transcript. +2. States the schema and outcome rules. +3. Includes the refusal/no-op classification rubric. +4. Instructs the solver to **never** propose code edits — its job is to observe, classify, and write the checkpoint. + +### Refusal Classification + +`assessAutonomousSolverTurn` (and the new solver-pass) checks executor transcript for: + +| Pattern | Classification | Action | +|---|---|---| +| "I'm sorry", "I cannot help", "I don't have the necessary tools", "I can't assist with that" | `executor-refused` | Emit `outcome=blocker`; on retry, escalate executor model tier | +| Zero tool calls, zero file edits, transcript < threshold | `executor-noop` | Emit `outcome=blocker` (or `continue` only if executor explicitly states a wait state); on retry, do not treat synthesized continue as progress | +| Tool calls + edits + explicit "I'm done" / completion signal | `progress` or `complete` | Emit `outcome=continue` or `complete` as appropriate | + +### Model Escalation on Refusal + +When solver classifies `executor-refused`, the loop records the executor's model and unit-type into a "no-fly" entry. On the next iteration of the same unit, the router consults this list and selects the next tier up (Sonnet → Opus, or via a model-tier graph). After 2 escalations on the same unit, pause the loop with a hard blocker. + +### Backward Compatibility + +- The existing checkpoint shape is preserved; downstream consumers (`auto-post-unit.js`, journal events, learning aggregator) are unchanged. +- The "executor calls the checkpoint tool" path is retained as a **fast path**: if the executor *did* emit a valid checkpoint AND the solver agrees with its classification, the solver pass is a no-op rubber stamp. The solver only synthesizes when the executor failed to checkpoint or classified incorrectly. +- The `mentioned-checkpoint-without-tool` repair attempts collapse to zero — the solver is now the source of truth, so a missing executor checkpoint is normal, not a defect. + +## Migration + +### Step 1 — Pin solver model + +Add `resolveSolverModel` to `model-router.js` (or a new `solver-model.js`). It does not participate in the router's capability scoring. Wire it into `runUnit`'s solver-pass invocation only. + +### Step 2 — Add solver pass + +After `runUnit` returns, before `assessAutonomousSolverTurn`, run the solver pass with the executor transcript. The solver pass writes the checkpoint directly. Executor checkpoint tool calls remain accepted but become advisory. + +### Step 3 — Refusal classifier + +Extend `classifyAutonomousSolverMissingCheckpointFailure` (rename to `classifyExecutorTurn`) to detect refusal patterns. Drive `outcome=blocker` from classification, not from "missing checkpoint." + +### Step 4 — Model escalation + +Add a per-(unitId, model) no-fly entry on `executor-refused`. Router consults the list during selection. + +### Step 5 — Tests + +Cover: pinned solver model invariant, refusal pattern detection, no-op detection, solver-pass checkpoint emission when executor is silent, fast-path bypass when executor emits a valid checkpoint, escalation chain. + +## Risks + +- **Solver-pass cost.** Adds one LLM call per unit. Mitigation: solver pass uses a smaller prompt (transcript summary only) and is skippable when executor emitted a valid checkpoint. +- **Locked model availability.** If `kimi-k2.6` is unreachable, solver pass fails. Mitigation: explicit fallback chain; if all fail, pause loop rather than synthesize. +- **Solver hallucination.** Solver could mis-classify and over-emit blockers. Mitigation: deterministic prompt template, classification rubric with example transcripts, and self-feedback when classification flips between iterations. + +## Open Questions + +1. Should the solver pass run *during* the executor turn (streaming observer) or *after* (post-turn observer)? Post-turn is simpler and proposed here; streaming would catch refusals earlier but adds complexity. +2. Should the solver pass also re-evaluate the executor's verification evidence (cite tests that actually exist, etc.) — i.e. become a partial verifier — or stay narrowly focused on checkpoint emission? +3. How does this interact with `keepSession: true` in `runUnit`? The solver pass is a separate session by definition; the executor session remains as-is. + +## Decision Outcome (when accepted) + +To be filled when the ADR is accepted. Initial cut targets steps 1–3 (pinned solver model + solver pass + refusal classifier). Steps 4–5 (escalation + tests) follow in a subsequent slice. diff --git a/src/resources/extensions/sf/auto/phases-unit.js b/src/resources/extensions/sf/auto/phases-unit.js index c1c84cd5a..754b50098 100644 --- a/src/resources/extensions/sf/auto/phases-unit.js +++ b/src/resources/extensions/sf/auto/phases-unit.js @@ -26,17 +26,23 @@ import { appendAutonomousSolverCheckpoint, assessAutonomousSolverTurn, beginAutonomousSolverIteration, + buildAutonomousExecutorPromptBlock, buildAutonomousSolverMissingCheckpointRepairPrompt, buildAutonomousSolverPromptBlock, buildAutonomousSolverSteeringPromptBlock, + buildSolverPassPrompt, classifyAutonomousSolverMissingCheckpointFailure, + classifyExecutorRefusal, consumePendingAutonomousSolverSteering, getConfiguredAutonomousSolverMaxIterations, + isNoOpExecutorTranscript, + readAutonomousSolverState, recordAutonomousSolverMissingCheckpointRetry, } from "../autonomous-solver.js"; import { resumeAutoAfterProviderDelay } from "../bootstrap/provider-error-resume.js"; import { debugLog } from "../debug-logger.js"; import { PROJECT_FILES } from "../detection.js"; +import { getErrorMessage } from "../error-utils.js"; import { MergeConflictError } from "../git-service.js"; import { recordLearnedOutcome } from "../learning/runtime.js"; import { sfRoot } from "../paths.js"; @@ -73,6 +79,14 @@ import { } from "../sf-db.js"; import { getEligibleSlices } from "../slice-parallel-eligibility.js"; import { startSliceParallel } from "../slice-parallel-orchestrator.js"; +import { + clearSliceRoutingForUnit, + recordSliceRouting, +} from "../slice-routing-cache.js"; +import { + resolveSolverModel, + resolveSolverModelCandidates, +} from "../solver-model.js"; import { handleProductAudit } from "../tools/product-audit-tool.js"; import { parseUnitId } from "../unit-id.js"; import { @@ -114,14 +128,17 @@ import { FINALIZE_PRE_TIMEOUT_MS, withTimeout, } from "./finalize-timeout.js"; +import { + emitCancelledUnitEnd, + recordLearningOutcomeForUnit, + shouldSkipArtifactVerification, +} from "./phases-helpers.js"; import { runUnit } from "./run-unit.js"; -import { getErrorMessage } from "../error-utils.js"; import { BUDGET_THRESHOLDS, MAX_FINALIZE_TIMEOUTS, MAX_RECOVERY_CHARS, } from "./types.js"; -import { emitCancelledUnitEnd, recordLearningOutcomeForUnit, shouldSkipArtifactVerification } from "./phases-helpers.js"; // ─── Session timeout scheduled resume state ──────────────────────────────────────── let consecutiveSessionTimeouts = 0; @@ -458,7 +475,7 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { if (steeringBlock) { finalPrompt = `${finalPrompt}\n\n---\n\n${steeringBlock}`; } - finalPrompt = `${finalPrompt}\n\n---\n\n${buildAutonomousSolverPromptBlock(solverState)}`; + finalPrompt = `${finalPrompt}\n\n---\n\n${buildAutonomousExecutorPromptBlock(solverState)}`; deps.emitJournalEvent({ ts: new Date().toISOString(), flowId: ic.flowId, @@ -505,8 +522,7 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { try { finalPrompt = deps.reorderForCaching(finalPrompt); } catch (reorderErr) { - const msg = - getErrorMessage(reorderErr); + const msg = getErrorMessage(reorderErr); logWarning("engine", "Prompt reorder failed", { error: msg }); } // Select and apply model (with tier escalation on retry — normal units only) @@ -706,13 +722,227 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { const unitResult = await runUnit(ctx, pi, s, unitType, unitId, finalPrompt); s.lastUnitAgentEndMessages = unitResult.event?.messages ?? null; let currentUnitResult = unitResult; + const executorMessages = unitResult.event?.messages ?? []; + const refusal = + unitResult.status !== "cancelled" + ? classifyExecutorRefusal(executorMessages) + : null; + // Short-circuit: if runUnit was cancelled (provider not ready, session - // failed, timeout) there is no checkpoint to repair — skip the repair loop + // failed, timeout) there is no checkpoint to repair — skip the solver pass // entirely and let the cancelled handler below surface the real cause. let solverAssessment = unitResult.status === "cancelled" ? { action: "none" } - : assessAutonomousSolverTurn(s.basePath, unitType, unitId); + : { action: "pending" }; + + // Refusal short-circuit: when the executor model returned a generic refusal, + // synthesize a blocked checkpoint immediately and skip the solver pass. + if (unitResult.status !== "cancelled" && refusal) { + const executorModel = + s.currentUnitModel?.provider && s.currentUnitModel?.id + ? `${s.currentUnitModel.provider}/${s.currentUnitModel.id}` + : (s.currentUnitModel?.id ?? "unknown"); + // Evict the sticky-routing entry for this slice — the model attached + // to it refused, so future units in the same slice should NOT re-pin + // the broken model. + try { + clearSliceRoutingForUnit(s.basePath, unitId); + } catch { + // best-effort + } + try { + appendAutonomousSolverCheckpoint(s.basePath, { + unitType, + unitId, + outcome: "blocked", + summary: `Executor (${executorModel}) refused the task. Pattern: ${refusal.pattern}. Repair-prompting the same model cannot produce progress; escalate the executor model or unblock this unit manually.`, + completedItems: [], + remainingItems: [ + `Re-run ${unitType} ${unitId} with a more capable executor model — current routing selected an incapable model.`, + ], + verificationEvidence: [ + `executor-refusal-pattern=${refusal.pattern}`, + `executor-model=${executorModel}`, + ], + blockerReason: `executor-refused (${refusal.pattern})`, + pdd: { + purpose: + "Surface executor refusals as protocol-level blockers instead of synthesizing fake progress.", + consumer: "autonomous loop pause-handler", + contract: + "On `executor-refused`, the loop pauses and self-feedback is filed; the operator must escalate the executor model.", + failureBoundary: + "If the operator does not escalate, the same refusal will recur on next dispatch.", + evidence: "classifyExecutorRefusal matched a refusal pattern", + nonGoals: + "This does not retry the unit automatically — capability mismatches require operator judgement (or a future automatic escalation policy).", + invariants: "Refusal never silently synthesizes a continue.", + assumptions: + "The refusal pattern set in classifyExecutorRefusal is conservative — false positives are rare and require operator review.", + }, + }); + } catch { + // If synthesis fails, fall through to solver pass + } + try { + const feedback = recordSelfFeedback( + { + kind: "executor-refused", + severity: "high", + summary: `Executor ${executorModel} refused ${unitType} ${unitId} with pattern ${refusal.pattern}; loop paused to prevent fake-progress synthesis.`, + evidence: [ + `unit=${unitType} ${unitId}`, + `executor=${executorModel}`, + `refusal-pattern=${refusal.pattern}`, + "", + refusal.evidence ?? "", + ].join("\n"), + suggestedFix: + "Escalate the executor model for this unit (or unit type) — the currently routed model lacks the agentic capabilities required. Long-term: separate the executor and autonomous-solver roles per ADR-0079 and pin the solver to a stable agentic model.", + acceptanceCriteria: [ + "Executor model for this unit type is escalated to a model that passes the refusal-resistant tier.", + "Refusal pattern is added to classifyExecutorRefusal if a novel phrasing slipped through.", + ], + occurredIn: { unitType, unitId }, + source: "runtime", + }, + s.basePath, + ); + deps.emitJournalEvent({ + ts: new Date().toISOString(), + flowId: ic.flowId, + seq: ic.nextSeq(), + eventType: "executor-refused", + data: { + unitType, + unitId, + executorModel, + pattern: refusal.pattern, + selfFeedbackId: feedback?.entry?.id, + blocking: feedback?.blocking, + }, + }); + } catch { + // self-feedback is observability; never block loop progression on it + } + ctx.ui.notify( + `Executor ${executorModel} refused ${unitType} ${unitId} (${refusal.pattern}); autonomous loop pausing instead of synthesizing fake progress. See SELF-FEEDBACK.md for escalation guidance.`, + "error", + ); + solverAssessment = assessAutonomousSolverTurn(s.basePath, unitType, unitId); + } + + // Solver pass: the stable solver model reads the executor transcript and + // emits the canonical checkpoint. This separates the executor role (unit + // work) from the solver role (protocol checkpoint) per ADR-0079. + if (unitResult.status !== "cancelled" && !refusal) { + const executorModel = s.currentUnitModel; + const solverCandidates = resolveSolverModelCandidates(prefs); + let solverPassResult = null; + + for (const candidate of solverCandidates) { + const availableModels = ctx.modelRegistry.getAvailable?.() ?? []; + const match = availableModels.find( + (m) => m.provider === candidate.provider && m.id === candidate.id, + ); + if (!match) continue; + + const ok = await pi.setModel(match, { persist: false }); + if (!ok) continue; + + s.currentUnitModel = match; + ctx.ui.notify( + `Running solver pass for ${unitType} ${unitId} with ${match.provider}/${match.id}`, + "info", + ); + + const solverState = readAutonomousSolverState(s.basePath); + const solverPrompt = buildSolverPassPrompt( + executorMessages, + solverState, + unitType, + unitId, + ); + + try { + const result = await runUnit( + ctx, + pi, + s, + unitType, + unitId, + solverPrompt, + { keepSession: false }, + ); + solverPassResult = result; + if (result.status !== "cancelled") { + currentUnitResult = result; + s.lastUnitAgentEndMessages = result.event?.messages ?? null; + break; // Solver pass succeeded + } + } catch { + // Try next fallback + } + } + + if (!solverPassResult || solverPassResult.status === "cancelled") { + ctx.ui.notify( + `Solver pass failed for ${unitType} ${unitId} — no solver model was reachable. Synthesizing blocked checkpoint.`, + "error", + ); + try { + appendAutonomousSolverCheckpoint(s.basePath, { + unitType, + unitId, + outcome: "blocked", + summary: `Solver pass failed — no solver model was reachable. The executor transcript could not be classified into a canonical checkpoint.`, + completedItems: [], + remainingItems: [ + `Retry ${unitType} ${unitId} after verifying solver model availability.`, + ], + verificationEvidence: ["solver-pass-failed"], + blockerReason: "solver-pass-failed", + pdd: { + purpose: + "Surface solver-pass failures as blockers rather than silently dropping the protocol layer.", + consumer: "autonomous loop pause-handler", + contract: + "On solver-pass failure, the loop pauses so the operator can fix model availability.", + failureBoundary: + "If all solver candidates are unreachable, the protocol layer cannot function.", + evidence: + "All solver candidates were unreachable or setModel failed.", + nonGoals: + "This does not retry with a different solver candidate automatically beyond the explicit fallback chain.", + invariants: + "Solver-pass failure never silently synthesizes a continue.", + assumptions: + "At least one solver candidate (kimi-k2.6 or fallback) is available in the model registry.", + }, + }); + } catch { + // best-effort + } + } + + solverAssessment = assessAutonomousSolverTurn( + s.basePath, + unitType, + unitId, + executorMessages, + ); + + // Restore executor model after solver pass and assessment + if (executorModel) { + try { + await pi.setModel(executorModel, { persist: false }); + } catch { + // best-effort restore + } + s.currentUnitModel = executorModel; + } + } while (solverAssessment.action === "missing-checkpoint-retry") { const diagnosis = classifyAutonomousSolverMissingCheckpointFailure( currentUnitResult.event?.messages ?? [], @@ -779,6 +1009,26 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { remainingCount: solverCheckpoint.remainingItems?.length ?? 0, }, }); + // Record sticky-routing on successful outcomes only. `continue` is the + // usual within-iteration progress signal; `complete` is final success. + // We deliberately skip `blocked` and `decide` because attaching a model + // to a slice when it's known-stuck or known-undecided would defeat the + // fallback path. + if ( + solverCheckpoint.outcome === "continue" || + solverCheckpoint.outcome === "complete" + ) { + try { + recordSliceRouting( + s.basePath, + unitType, + unitId, + s.currentUnitModel ?? ctx.model ?? null, + ); + } catch { + // best-effort; routing cache must never break the loop + } + } } if (solverAssessment.action === "pause") { const isMissingCheckpoint = @@ -808,7 +1058,7 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { acceptanceCriteria: [ "Missing-checkpoint repair attempts include failure classification in the prompt.", "Repeated repair failures file self-feedback automatically.", - "Loop continues with a synthesized checkpoint instead of pausing for human input.", + "Loop continues with a synthesized checkpoint instead of pausing for human input — EXCEPT when classifyExecutorRefusal short-circuits with `executor-refused`, in which case the loop emits a `blocked` checkpoint and pauses (synthesizing forward progress over a refusing executor is the bug we are fixing).", ], occurredIn: { unitType, unitId }, source: "runtime", @@ -1087,8 +1337,7 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { resume: allowAutoResume ? () => { void resumeAutoAfterProviderDelay(pi, ctx).catch((err) => { - const message = - getErrorMessage(err); + const message = getErrorMessage(err); ctx.ui.notify( `Session timeout recovery failed: ${message}`, "error", @@ -1280,10 +1529,7 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { }); } catch (err) { /* non-fatal — anchor is advisory */ - logWarning( - "engine", - `phase anchor failed: ${getErrorMessage(err)}`, - ); + logWarning("engine", `phase anchor failed: ${getErrorMessage(err)}`); } } if (currentUnitResult.status !== "completed" || !artifactVerified) { diff --git a/src/resources/extensions/sf/autonomous-solver.js b/src/resources/extensions/sf/autonomous-solver.js index 8d0b551d0..bb94a79a8 100644 --- a/src/resources/extensions/sf/autonomous-solver.js +++ b/src/resources/extensions/sf/autonomous-solver.js @@ -281,7 +281,7 @@ export function beginAutonomousSolverIteration( * * Consumer: runUnitPhase prompt injection. */ -export function buildAutonomousSolverPromptBlock(state) { +function _buildAutonomousLoopPromptPrefix(state, header) { const phase = getSolverPhase(state.iteration, state.maxIterations); const stalled = Number(state.iterationsSinceProgress) >= STALL_THRESHOLD_ITERATIONS; @@ -306,7 +306,7 @@ export function buildAutonomousSolverPromptBlock(state) { }; const lines = [ - "## Autonomous Solver Loop Contract", + `## ${header}`, "", `You are inside /autonomous iteration ${state.iteration} of ${state.maxIterations} for ${state.unitType} ${state.unitId}.`, "", @@ -357,6 +357,25 @@ export function buildAutonomousSolverPromptBlock(state) { ); } + return lines; +} + +/** + * Build the PDD autonomous solver prompt block appended to unit prompts. + * + * Purpose: bind every autonomous unit to bounded iterations, evidence, stop + * signals, and the eight PDD fields instead of open-ended hidden retries. + * Phase-aware: ORIENT (iters 1-2) focuses on reading and planning; EXECUTE + * (middle) on implementation; CLOSE (final 3) on verifying and wrapping up. + * Stall/loop signals are injected when the system detects no progress. + * + * Consumer: runUnitPhase prompt injection (solver pass). + */ +export function buildAutonomousSolverPromptBlock(state) { + const lines = _buildAutonomousLoopPromptPrefix( + state, + "Autonomous Solver Loop Contract", + ); lines.push( "", "## CHECKPOINT REQUIREMENT", @@ -390,6 +409,142 @@ export function buildAutonomousSolverPromptBlock(state) { return lines.join("\n"); } +/** + * Build the executor prompt block (no checkpoint requirement). + * + * Purpose: the executor focuses on doing the unit work. A separate solver pass + * reads the executor transcript and emits the canonical checkpoint. + * + * Consumer: runUnitPhase prompt injection (executor pass). + */ +export function buildAutonomousExecutorPromptBlock(state) { + const lines = _buildAutonomousLoopPromptPrefix( + state, + "Autonomous Executor Contract", + ); + lines.push( + "", + "## EXECUTOR ROLE", + "", + "Your job is to do the unit work: read files, run tests, edit code, and produce concrete artifacts.", + "You do NOT need to call the `checkpoint` tool. A separate solver pass will observe your work and emit the canonical checkpoint.", + "Focus entirely on making verifiable progress toward the task goal.", + "", + "If you are executing an `execute-task` unit and the task is finished, `complete_task` remains mandatory.", + "End your turn when the bounded work is done or when you have made meaningful progress and need to wait for the next iteration.", + ); + return lines.join("\n"); +} + +/** + * Build the solver pass prompt that reads an executor transcript. + * + * Purpose: give the stable solver model the executor transcript and instruct it + * to classify what happened and emit the canonical checkpoint. + * + * Consumer: runUnitPhase after the executor pass returns. + */ +export function buildSolverPassPrompt( + executorTranscript, + state, + unitType, + unitId, +) { + const transcriptText = stringifyMessages(executorTranscript); + const refusal = classifyExecutorRefusal(executorTranscript); + + const lines = [ + "## Autonomous Solver Pass", + "", + `You are the protocol solver for ${unitType} ${unitId} · iteration ${state?.iteration ?? "unknown"} of ${state?.maxIterations ?? "unknown"}.`, + "", + "Your sole job is to read the executor transcript below, classify what happened, and emit a canonical checkpoint via the `checkpoint` tool.", + "Do NOT edit files, run commands, or propose code changes. Observe and classify only.", + "", + "## Classification Rubric", + "", + "- `executor-refused`: The executor emitted a generic refusal ('I'm sorry', 'I cannot help', 'I don't have the necessary tools'). → checkpoint outcome=`blocked`, blockerReason=`executor-refused`.", + "- `executor-noop`: The executor emitted prose but made zero tool calls, zero file edits, and zero measurable progress. → checkpoint outcome=`blocked` (or `continue` ONLY if the executor explicitly states it is waiting for an external event).", + "- `progress`: The executor made concrete progress (file edits, tests run, tools called). → checkpoint outcome=`continue` with accurate completedItems/remainingItems.", + "- `complete`: The executor finished the unit's required artifact AND called any mandatory completion tool. → checkpoint outcome=`complete`.", + "- `blocker-other`: The executor hit a hard blocker (missing credentials, broken environment). → checkpoint outcome=`blocked` with a precise blockerReason.", + "", + "## Executor Transcript", + "", + "```", + transcriptText, + "```", + "", + ]; + + if (refusal) { + lines.push( + `⚠️ Refusal pattern detected: ${refusal.pattern}.`, + "The executor refused the task. Emit outcome='blocked' with blockerReason='executor-refused'.", + "", + ); + } + + lines.push( + "Call `checkpoint` with all eight PDD fields and accurate completedItems / remainingItems.", + "Your final action MUST be the checkpoint tool call.", + ); + + return lines.join("\n"); +} + +/** + * Detect whether an executor transcript contains zero meaningful work. + * + * Purpose: no-op iterations (continue checkpoint with zero file/tool activity) + * must not satisfy the repair gate. + * + * Consumer: assessAutonomousSolverTurn to reject no-op continues. + */ +export function isNoOpExecutorTranscript(messages) { + if (!Array.isArray(messages) || messages.length === 0) return true; + + // Refusal is always a no-op + if (classifyExecutorRefusal(messages)) return true; + + for (const msg of messages) { + if (!msg || typeof msg !== "object") continue; + + // Assistant requested non-checkpoint tool calls + if (Array.isArray(msg.tool_calls)) { + for (const tc of msg.tool_calls) { + const name = tc?.function?.name ?? tc?.name ?? ""; + if (name && name !== "checkpoint") { + return false; + } + } + } + + // Tool results from non-checkpoint tools + if (msg.role === "tool" || msg.role === "tool_result") { + const name = msg.name ?? ""; + if (name && name !== "checkpoint") { + return false; + } + } + + // Content that shows concrete work was done + const content = typeof msg.content === "string" ? msg.content : ""; + if ( + content.includes("File edited") || + content.includes("File written") || + content.includes("File created") || + content.includes("```diff") || + content.includes("--- a/") || + content.includes("+++ b/") + ) { + return false; + } + } + + return true; +} + /** * Record a solver checkpoint and update the markdown projection. * @@ -541,6 +696,141 @@ export function recordAutonomousSolverMissingCheckpointRetry( return nextState; } +/** + * Detect that the executor model refused the task outright (rather than + * attempting and failing the protocol). + * + * Why: when a routed executor model (e.g. a code-completion model like + * Codestral) lacks the agentic capabilities required for the unit, it emits a + * generic refusal — "I'm sorry, I currently don't have the necessary tools to + * assist with that specific request." The existing missing-checkpoint repair + * loop will dutifully re-prompt the same model until it emits a syntactically + * valid checkpoint with zero work, fabricating forward progress. Refusal must + * be caught earlier and surfaced as a `blocked` outcome so the loop pauses (or + * the executor model can be escalated on retry) rather than synthesizing a + * `continue` over no work. + * + * Returns null when no refusal pattern is detected. + * + * Consumer: runUnitPhase short-circuits the repair loop on a positive match. + */ +export function classifyExecutorRefusal(messages) { + const text = stringifyMessages(messages); + if (!text.trim()) return null; + const lower = text.toLowerCase(); + const patterns = [ + { + id: "apology-no-tools", + regex: + /i(?:'m| am)\s+sorry[^.]{0,80}(?:don't|do not|cannot|can't)\s+have\s+(?:the\s+)?(?:necessary\s+)?tools?/i, + }, + { + id: "cannot-assist", + regex: + /i\s+(?:cannot|can't|am unable to|won't be able to)\s+(?:assist|help)\s+with\s+(?:that|this)/i, + }, + { + id: "not-able-to-help", + regex: + /i\s+(?:am\s+not\s+able\s+to|do not have the ability to|don't have the ability to)\s+(?:help|assist|complete|perform)/i, + }, + { + id: "feel-free-to-ask", + // Catches the canonical "I'm sorry … feel free to ask" deflection even + // when the apology phrasing doesn't match the first two patterns. + regex: + /(?:i(?:'m| am)\s+sorry|i\s+apologi[sz]e)[\s\S]{0,200}feel\s+free\s+to\s+ask/i, + }, + { + id: "outside-capabilities", + regex: + /(?:that's|that is|this is)\s+(?:outside|beyond)\s+(?:my|the)\s+(?:capabilities|abilities|scope)/i, + }, + ]; + for (const pattern of patterns) { + if (pattern.regex.test(lower) || pattern.regex.test(text)) { + return { + classification: "executor-refused", + pattern: pattern.id, + summary: + "The executor model refused the task rather than attempting it. This is a capability/routing problem, not a protocol problem — repairing the prompt will not produce progress.", + evidence: truncateEvidence(text), + }; + } + } + return null; +} + +/** + * Memoized lookup: is the `checkpoint` tool registered in the SF extension + * manifest? Used by `classifyAutonomousSolverMissingCheckpointFailure` to + * disambiguate "agent says tool is unavailable" from "agent mentioned a real + * tool but did not call it." + * + * Why memoized: the previous implementation read the manifest from disk on + * every classifier call, with CWD-sensitive path probing — surprising hidden + * I/O inside what reads like a pure function, and a test-unfriendly coupling. + * The manifest does not change while the process is running, so a single + * memoized read at first call is correct and fast. + * + * Callers that want test-time control (or are running in environments where + * the manifest can't be located, e.g. CI fixtures) pass an explicit + * `checkpointToolRegistered` override to the classifier instead — no need to + * stub the filesystem. + */ +let _checkpointToolRegisteredCache = null; +function isCheckpointToolRegisteredFromManifest() { + if (_checkpointToolRegisteredCache !== null) { + return _checkpointToolRegisteredCache; + } + try { + const manifestPath = join( + process.cwd(), + "dist", + "resources", + "extensions", + "sf", + "extension-manifest.json", + ); + const srcManifestPath = join( + process.cwd(), + "src", + "resources", + "extensions", + "sf", + "extension-manifest.json", + ); + const manifestContent = existsSync(manifestPath) + ? readFileSync(manifestPath, "utf-8") + : existsSync(srcManifestPath) + ? readFileSync(srcManifestPath, "utf-8") + : null; + if (!manifestContent) { + _checkpointToolRegisteredCache = false; + return false; + } + const manifest = JSON.parse(manifestContent); + _checkpointToolRegisteredCache = + Array.isArray(manifest?.provides?.tools) && + manifest.provides.tools.includes("checkpoint"); + return _checkpointToolRegisteredCache; + } catch { + _checkpointToolRegisteredCache = false; + return false; + } +} + +/** + * Test-only escape hatch to reset the manifest-lookup memoization. Tests that + * exercise the classifier under different "is checkpoint registered" assumptions + * should prefer the explicit `options.checkpointToolRegistered` override on + * `classifyAutonomousSolverMissingCheckpointFailure` — this function exists + * only as a safety net for tests that need to clear a polluted module cache. + */ +export function _resetCheckpointToolRegisteredCacheForTests() { + _checkpointToolRegisteredCache = null; +} + /** * Classify why a solver turn omitted the checkpoint tool. * @@ -549,8 +839,18 @@ export function recordAutonomousSolverMissingCheckpointRetry( * * Consumer: runUnitPhase before repair redispatch and before missing-checkpoint * pause/self-feedback. + * + * @param {Array} messages + * @param {object} [options] + * @param {boolean} [options.checkpointToolRegistered] - Explicit override for + * the "is the checkpoint tool registered in our manifest?" question. When + * omitted, falls back to the memoized manifest lookup. Tests should pass + * this explicitly so they don't depend on CWD or on-disk dist/ state. */ -export function classifyAutonomousSolverMissingCheckpointFailure(messages) { +export function classifyAutonomousSolverMissingCheckpointFailure( + messages, + options = {}, +) { const text = stringifyMessages(messages); const lower = text.toLowerCase(); if (!text.trim()) { @@ -561,43 +861,12 @@ export function classifyAutonomousSolverMissingCheckpointFailure(messages) { }; } const mentionsCheckpoint = lower.includes("checkpoint"); - // Check whether checkpoint is actually registered in the manifest. - // When the agent reports "tool unavailable" but the tool IS registered, this means - // the agent mentioned the tool without calling it — reclassify accordingly to - // break the self-referential repair loop. - const checkpointToolIsRegistered = (() => { - try { - const manifestPath = join( - process.cwd(), - "dist", - "resources", - "extensions", - "sf", - "extension-manifest.json", - ); - const srcManifestPath = join( - process.cwd(), - "src", - "resources", - "extensions", - "sf", - "extension-manifest.json", - ); - const manifestContent = existsSync(manifestPath) - ? readFileSync(manifestPath, "utf-8") - : existsSync(srcManifestPath) - ? readFileSync(srcManifestPath, "utf-8") - : null; - if (!manifestContent) return false; - const manifest = JSON.parse(manifestContent); - return ( - Array.isArray(manifest?.provides?.tools) && - manifest.provides.tools.includes("checkpoint") - ); - } catch { - return false; - } - })(); + // Resolve "is checkpoint registered" — explicit override wins, otherwise + // fall back to the memoized manifest lookup. + const checkpointToolIsRegistered = + typeof options.checkpointToolRegistered === "boolean" + ? options.checkpointToolRegistered + : isCheckpointToolRegisteredFromManifest(); const mentionsToolUnavailable = /(unknown|unavailable|not available|not found|no such) tool/.test(lower) || (lower.includes("checkpoint") && @@ -676,7 +945,12 @@ export function classifyAutonomousSolverMissingCheckpointFailure(messages) { * * Consumer: runUnitPhase immediately after each unit turn. */ -export function assessAutonomousSolverTurn(basePath, unitType, unitId) { +export function assessAutonomousSolverTurn( + basePath, + unitType, + unitId, + executorMessages = null, +) { const state = readJson(statePath(basePath)); if (!sameUnit(state, unitType, unitId)) { return { @@ -730,6 +1004,34 @@ export function assessAutonomousSolverTurn(basePath, unitType, unitId) { checkpoint, }; } + // No-op detection: a continue with zero work is not real progress + if ( + (checkpoint.outcome === "continue" || checkpoint.outcome === "decide") && + executorMessages && + isNoOpExecutorTranscript(executorMessages) + ) { + const repairAttempts = getMissingCheckpointRepairAttempts(state).filter( + (attempt) => Number(attempt.iteration) === Number(state.iteration), + ).length; + if (repairAttempts >= DEFAULT_MISSING_CHECKPOINT_REPAIR_ATTEMPTS) { + return { + action: "pause", + reason: "solver-noop-continue", + state, + repairAttempts, + maxRepairAttempts: DEFAULT_MISSING_CHECKPOINT_REPAIR_ATTEMPTS, + checkpoint, + }; + } + return { + action: "missing-checkpoint-retry", + reason: "solver-noop-continue", + state, + repairAttempt: repairAttempts + 1, + maxRepairAttempts: DEFAULT_MISSING_CHECKPOINT_REPAIR_ATTEMPTS, + checkpoint, + }; + } // "decide" is treated as "continue": agent reconstructs best-effort and moves on return { action: diff --git a/src/resources/extensions/sf/solver-model.js b/src/resources/extensions/sf/solver-model.js new file mode 100644 index 000000000..849774a27 --- /dev/null +++ b/src/resources/extensions/sf/solver-model.js @@ -0,0 +1,119 @@ +/** + * solver-model.js — pinned model selection for the autonomous solver role. + * + * Why this exists: + * The "executor" and "autonomous solver" roles were historically conflated + * into a single LLM call selected by the unit-type router. When the router + * picked a coding-tuned or capability-limited model for the executor (e.g. + * `mistral/codestral-latest`, `google-gemini-cli/gemini-3-flash-preview`), + * the same model was expected to (a) do the unit work and (b) emit the + * canonical protocol checkpoint. Models that refuse agentic tasks or fail + * to follow tool-use contracts broke the protocol layer entirely — and the + * missing-checkpoint repair loop could only re-prompt the same broken + * model, synthesizing fake `continue` outcomes over zero progress. + * + * The solver role MUST stay on a stable, agentic, refusal-resistant model + * independent of any per-unit routing choices. This module is the single + * place that decision is made. + * + * Contract: + * - Default solver model is `kimi-k2.6` (provider: `kimi-coding`). + * - Preference override is accepted ONLY when the operator has explicitly + * opted into it via `preferences.autonomousSolver.model`. Router output, + * benchmark scoring, learning blender, and unit-type routing are NEVER + * consulted here. + * - A fallback chain is provided so a brief outage of the primary does not + * take the protocol layer with it. + * + * Consumers (forthcoming): the solver-pass invocation in auto/phases-unit.js + * once the two-layer loop lands (see ADR-0079). + */ + +/** + * Default model for the autonomous solver role. Locked. Do not change without + * an ADR update — this is a protocol invariant, not a tuning parameter. + */ +export const SOLVER_MODEL_DEFAULT = { + provider: "kimi-coding", + id: "kimi-k2.6", +}; + +/** + * Explicit fallback chain when the default is unreachable. Ordered by + * preference. Each entry must be a stable agentic model that follows tool-use + * contracts; nothing on this list is a code-completion-only model. + */ +export const SOLVER_MODEL_FALLBACKS = [ + { provider: "anthropic", id: "claude-sonnet-4-6" }, + { provider: "anthropic", id: "claude-opus-4-7" }, +]; + +/** + * Resolve which model should fill the solver role for the current run. + * + * @param {object} [preferences] - Operator preferences object. Only consulted + * for `preferences.autonomousSolver.model`. Anything else is ignored. + * @returns {{ provider: string, id: string }} the selected solver model + */ +export function resolveSolverModel(preferences) { + const override = preferences?.autonomousSolver?.model; + if (override && typeof override === "object" && override.id) { + return { + provider: String(override.provider ?? SOLVER_MODEL_DEFAULT.provider), + id: String(override.id), + }; + } + if (typeof override === "string" && override.trim()) { + // Allow "provider/model" short form for ergonomics; default provider when + // only a model id is supplied. + const trimmed = override.trim(); + const slash = trimmed.indexOf("/"); + if (slash > 0) { + return { + provider: trimmed.slice(0, slash), + id: trimmed.slice(slash + 1), + }; + } + return { provider: SOLVER_MODEL_DEFAULT.provider, id: trimmed }; + } + return { ...SOLVER_MODEL_DEFAULT }; +} + +/** + * Resolve the ordered candidate list for the solver role: primary first, then + * the fallback chain. Callers iterate until they find a reachable provider. + * + * @param {object} [preferences] + * @returns {Array<{ provider: string, id: string }>} + */ +export function resolveSolverModelCandidates(preferences) { + const primary = resolveSolverModel(preferences); + const candidates = [primary]; + for (const fallback of SOLVER_MODEL_FALLBACKS) { + if ( + fallback.provider === primary.provider && + fallback.id === primary.id + ) { + continue; + } + candidates.push({ ...fallback }); + } + return candidates; +} + +/** + * True if the supplied model would be selected as solver for these preferences. + * Useful for invariants and tests. + * + * @param {{ provider?: string, id?: string }} model + * @param {object} [preferences] + * @returns {boolean} + */ +export function isSolverModel(model, preferences) { + if (!model?.id) return false; + const solver = resolveSolverModel(preferences); + return ( + String(model.provider ?? solver.provider) === solver.provider && + String(model.id) === solver.id + ); +} diff --git a/src/resources/extensions/sf/tests/autonomous-solver.test.mjs b/src/resources/extensions/sf/tests/autonomous-solver.test.mjs index 96448023b..e80288906 100644 --- a/src/resources/extensions/sf/tests/autonomous-solver.test.mjs +++ b/src/resources/extensions/sf/tests/autonomous-solver.test.mjs @@ -7,13 +7,17 @@ import { appendAutonomousSolverSteering, assessAutonomousSolverTurn, beginAutonomousSolverIteration, + buildAutonomousExecutorPromptBlock, buildAutonomousSolverMissingCheckpointRepairPrompt, buildAutonomousSolverPromptBlock, + buildSolverPassPrompt, classifyAutonomousSolverMissingCheckpointFailure, + classifyExecutorRefusal, consumePendingAutonomousSolverSteering, detectSolverLoop, getConfiguredAutonomousSolverMaxIterations, getSolverPhase, + isNoOpExecutorTranscript, readAutonomousSolverState, readLatestAutonomousSolverCheckpoint, recordAutonomousSolverMissingCheckpointRetry, @@ -565,3 +569,368 @@ describe("autonomous solver", () => { expect(prompt).toContain("Do not describe or narrate the checkpoint"); }); }); + +describe("classifyExecutorRefusal", () => { + test("detects the canonical apology-no-tools refusal verbatim from M001-6377a4/S04/T02", () => { + // Real-world refusal captured when mistral/codestral-latest was routed as + // executor for execute-task M001-6377a4/S04/T02 on 2026-05-12. The + // classifier must catch this exact phrasing or the repair loop will + // re-prompt the same broken model and synthesize fake progress. + const refusal = classifyExecutorRefusal([ + { + role: "assistant", + content: + "I'm sorry, but I currently don't have the necessary tools to assist with that specific request. If you have any other questions or need help with something else, feel free to ask!", + }, + ]); + expect(refusal).not.toBeNull(); + expect(refusal.classification).toBe("executor-refused"); + // The "apology-no-tools" pattern is the most specific; "feel-free-to-ask" + // is a fallback that may match the same string. Either is acceptable as + // long as the result is a refusal. + expect(["apology-no-tools", "feel-free-to-ask"]).toContain(refusal.pattern); + }); + + test("detects 'I cannot assist with that' phrasing", () => { + const refusal = classifyExecutorRefusal([ + { role: "assistant", content: "I cannot assist with that request." }, + ]); + expect(refusal).not.toBeNull(); + expect(refusal.pattern).toBe("cannot-assist"); + }); + + test("detects 'outside my capabilities' phrasing", () => { + const refusal = classifyExecutorRefusal([ + { + role: "assistant", + content: + "That's outside my capabilities — I am unable to perform file edits.", + }, + ]); + expect(refusal).not.toBeNull(); + }); + + test("returns null on legitimate work transcripts", () => { + expect( + classifyExecutorRefusal([ + { role: "assistant", content: "I read the file and edited line 42." }, + ]), + ).toBeNull(); + expect( + classifyExecutorRefusal([ + { + role: "assistant", + content: + "Checkpoint recorded: outcome=continue, completed steps 1-3.", + }, + ]), + ).toBeNull(); + }); + + test("returns null on empty or missing transcripts", () => { + expect(classifyExecutorRefusal(null)).toBeNull(); + expect(classifyExecutorRefusal([])).toBeNull(); + expect( + classifyExecutorRefusal([{ role: "assistant", content: "" }]), + ).toBeNull(); + }); + + test("does not misfire on the apology word in normal narration", () => { + // We only want to match refusals, not any sentence containing "sorry". + // A model saying "Sorry for the long output below — here's the full + // diff" should not be classified as a refusal. + const refusal = classifyExecutorRefusal([ + { + role: "assistant", + content: + "Sorry for the long output below — here is the full diff of the change. I am about to run the tests now.", + }, + ]); + expect(refusal).toBeNull(); + }); + + test("evidence is truncated for storage", () => { + const refusal = classifyExecutorRefusal([ + { + role: "assistant", + content: + "I'm sorry, I currently don't have the necessary tools. " + + "x".repeat(8000), + }, + ]); + expect(refusal).not.toBeNull(); + expect(refusal.evidence.length).toBeLessThanOrEqual(4200); + }); +}); + +describe("buildAutonomousExecutorPromptBlock", () => { + test("omits checkpoint requirement but keeps phase guidance", () => { + const prompt = buildAutonomousExecutorPromptBlock({ + unitType: "execute-task", + unitId: "M001/S01/T01", + iteration: 3, + maxIterations: 12, + }); + + expect(prompt).toContain("Autonomous Executor Contract"); + expect(prompt).toContain("/autonomous iteration 3 of 12"); + expect(prompt).toContain("EXECUTE PHASE"); + expect(prompt).not.toContain("CHECKPOINT REQUIREMENT"); + expect(prompt).not.toContain( + "Hard requirement: before ending the turn, call the actual `checkpoint` tool", + ); + expect(prompt).toContain("You do NOT need to call the `checkpoint` tool"); + expect(prompt).toContain("A separate solver pass will observe your work"); + }); +}); + +describe("buildSolverPassPrompt", () => { + test("includes executor transcript and classification rubric", () => { + const prompt = buildSolverPassPrompt( + [{ role: "assistant", content: "I edited the file." }], + { iteration: 2, maxIterations: 10 }, + "execute-task", + "M001/S01/T01", + ); + + expect(prompt).toContain("Autonomous Solver Pass"); + expect(prompt).toContain("protocol solver for execute-task M001/S01/T01"); + expect(prompt).toContain("Classification Rubric"); + expect(prompt).toContain("executor-refused"); + expect(prompt).toContain("executor-noop"); + expect(prompt).toContain("progress"); + expect(prompt).toContain("I edited the file."); + expect(prompt).toContain( + "Your final action MUST be the checkpoint tool call", + ); + }); + + test("injects refusal warning when refusal is detected", () => { + const prompt = buildSolverPassPrompt( + [ + { + role: "assistant", + content: + "I'm sorry, but I currently don't have the necessary tools to assist with that specific request.", + }, + ], + { iteration: 1, maxIterations: 10 }, + "execute-task", + "M001/S01/T01", + ); + + expect(prompt).toContain("Refusal pattern detected"); + expect(prompt).toContain("Emit outcome='blocked'"); + }); +}); + +describe("isNoOpExecutorTranscript", () => { + test("returns true for empty transcripts", () => { + expect(isNoOpExecutorTranscript([])).toBe(true); + expect(isNoOpExecutorTranscript(null)).toBe(true); + expect(isNoOpExecutorTranscript(undefined)).toBe(true); + }); + + test("returns true for refusal transcripts", () => { + expect( + isNoOpExecutorTranscript([ + { + role: "assistant", + content: + "I'm sorry, but I currently don't have the necessary tools to assist with that specific request.", + }, + ]), + ).toBe(true); + }); + + test("returns false for transcripts with tool calls", () => { + expect( + isNoOpExecutorTranscript([ + { + role: "assistant", + content: "I'll edit the file now.", + tool_calls: [ + { + id: "tc_1", + function: { name: "edit", arguments: "{}" }, + }, + ], + }, + ]), + ).toBe(false); + }); + + test("returns false for tool result messages", () => { + expect( + isNoOpExecutorTranscript([ + { + role: "tool", + name: "bash", + content: "done", + }, + ]), + ).toBe(false); + }); + + test("returns true for prose-only transcripts", () => { + expect( + isNoOpExecutorTranscript([ + { + role: "assistant", + content: "I think I understand the problem now.", + }, + ]), + ).toBe(true); + }); + + test("returns true when only checkpoint tool was called", () => { + expect( + isNoOpExecutorTranscript([ + { + role: "assistant", + content: "Let me checkpoint.", + tool_calls: [ + { + id: "tc_1", + function: { name: "checkpoint", arguments: "{}" }, + }, + ], + }, + ]), + ).toBe(true); + }); +}); + +describe("assessAutonomousSolverTurn no-op detection", () => { + test("continue_with_no_op_executor_messages_returns_missing_checkpoint_retry", () => { + const project = makeProject(); + beginAutonomousSolverIteration(project, "execute-task", "M001/S01/T01"); + appendAutonomousSolverCheckpoint(project, { + unitType: "execute-task", + unitId: "M001/S01/T01", + outcome: "continue", + summary: "More work remains.", + completedItems: ["First pass"], + remainingItems: ["Second pass"], + verificationEvidence: ["npx vitest run focused.test.mjs"], + pdd: pdd(), + }); + + const result = assessAutonomousSolverTurn( + project, + "execute-task", + "M001/S01/T01", + [ + { + role: "assistant", + content: "I think I understand the problem now.", + }, + ], + ); + expect(result.action).toBe("missing-checkpoint-retry"); + expect(result.reason).toBe("solver-noop-continue"); + }); + + test("continue_with_real_work_executor_messages_returns_continue", () => { + const project = makeProject(); + beginAutonomousSolverIteration(project, "execute-task", "M001/S01/T01"); + appendAutonomousSolverCheckpoint(project, { + unitType: "execute-task", + unitId: "M001/S01/T01", + outcome: "continue", + summary: "More work remains.", + completedItems: ["First pass"], + remainingItems: ["Second pass"], + verificationEvidence: ["npx vitest run focused.test.mjs"], + pdd: pdd(), + }); + + const result = assessAutonomousSolverTurn( + project, + "execute-task", + "M001/S01/T01", + [ + { + role: "assistant", + content: "I'll edit the file now.", + tool_calls: [ + { + id: "tc_1", + function: { name: "edit", arguments: "{}" }, + }, + ], + }, + ], + ); + expect(result.action).toBe("continue"); + expect(result.reason).toBe("solver-continue"); + }); + + test("no_op_continue_after_max_repairs_returns_pause", () => { + const project = makeProject(); + beginAutonomousSolverIteration(project, "execute-task", "M001/S01/T01"); + appendAutonomousSolverCheckpoint(project, { + unitType: "execute-task", + unitId: "M001/S01/T01", + outcome: "continue", + summary: "More work remains.", + completedItems: ["First pass"], + remainingItems: ["Second pass"], + verificationEvidence: ["npx vitest run focused.test.mjs"], + pdd: pdd(), + }); + + for (let i = 0; i < 4; i++) { + recordAutonomousSolverMissingCheckpointRetry( + project, + "execute-task", + "M001/S01/T01", + ); + } + + const result = assessAutonomousSolverTurn( + project, + "execute-task", + "M001/S01/T01", + [ + { + role: "assistant", + content: "I think I understand the problem now.", + }, + ], + ); + expect(result.action).toBe("pause"); + expect(result.reason).toBe("solver-noop-continue"); + expect(result.repairAttempts).toBe(4); + }); + + test("refusal_transcript_returns_missing_checkpoint_retry_even_with_continue_checkpoint", () => { + const project = makeProject(); + beginAutonomousSolverIteration(project, "execute-task", "M001/S01/T01"); + appendAutonomousSolverCheckpoint(project, { + unitType: "execute-task", + unitId: "M001/S01/T01", + outcome: "continue", + summary: "More work remains.", + completedItems: ["First pass"], + remainingItems: ["Second pass"], + verificationEvidence: ["npx vitest run focused.test.mjs"], + pdd: pdd(), + }); + + const result = assessAutonomousSolverTurn( + project, + "execute-task", + "M001/S01/T01", + [ + { + role: "assistant", + content: + "I'm sorry, but I currently don't have the necessary tools to assist with that specific request.", + }, + ], + ); + expect(result.action).toBe("missing-checkpoint-retry"); + expect(result.reason).toBe("solver-noop-continue"); + }); +});