sf snapshot: uncommitted changes after 56m inactivity

2026-05-16 14:59:40 +02:00 · 2026-05-16 14:59:40 +02:00 · da0c41d375
commit da0c41d375
parent 6071a9207c
8 changed files with 188 additions and 74 deletions
--- a/.sf/REQUIREMENTS.md
+++ b/.sf/REQUIREMENTS.md
@ -136,6 +136,39 @@ This file is the explicit capability and coverage contract for the project.
 - Validation: unmapped
 - Notes: Requires (1) new prompt template `prompts/fill-milestone-vision.md`, (2) new dispatchable unit wired in `auto-dispatch.js` + `state-transition-matrix.js`, (3) an exception in `buildRegistryAndFindActive` for one-shot `status=complete && vision=""` repair, (4) inline-fixer handler that converts the R011 self-feedback entry into a dispatch. Must satisfy R006 (fail-open) — recovery-unit failure halts with notification, never crashes the loop.

+### R013 — Unified Dispatch v2: `inline` Scope for `full` Isolation
+- Class: core-capability
+- Status: active
+- Description: Implement the `inline` scope row of `UNIFIED_DISPATCH_V2_PLAN.md`'s parameter matrix (line 152: `full | managed | inline | single`) so the autonomous loop can execute units in-process without spawning a subprocess/worktree. A new `src/resources/extensions/sf/dispatch-layer.js` exposes `DispatchLayer.dispatch(opts)` per the plan's API spec (lines 51-138). When `scope: 'inline'` and `isolation: 'full'`, the unit's executor runs in the calling process against the project DB directly — no `child_process.spawn`, no session-status-io files, no worktree.
+- Why it matters: The current spawn-based path silently fails on `validate-milestone` and likely other unit types (self-feedback `sf-mp8bhp5s-cmgt8d`, critical, blocking) — worker session IDs are issued and tracked in `.sf/runtime/units/*.json` but the worker never writes its session JSONL and `recoveryAttempts` stays at 0 across runaway-final-warning phases. Universal across providers (kimi-k2.6 and minimax both produce 0 tool calls with heartbeats only). Adding an inline path naturally retires this whole class of bug for units that don't need worktree isolation. Also reduces process-start latency and removes the file-based-IPC pressure point that has accumulated multiple historical issues.
+- Source: spec
+- Primary owning slice: unmapped
+- Supporting slices: none
+- Validation: unmapped
+- Notes: Aligned with `docs/plans/UNIFIED_DISPATCH_V2_PLAN.md` (Qwen Plan, 2026-05-08). Scope of R013 is the **minimum slice** of that plan: just `full + managed + inline + single`. Other rows of the matrix (parallel/debate/chain inline, slice/milestone scope with worktrees) are out of scope for R013 and stay on their current implementations. Resolves `sf-mp8bhp5s-cmgt8d` and likely the 56+ historical `runaway-loop:idle-halt` entries on M005.
+
+### R014 — Inline Worker Bootstrap Without Spawned `sf` CLI
+- Class: core-capability
+- Status: active
+- Description: Extract the unit-execution code path that `sf headless autonomous` currently invokes after spawn into a callable function (`runUnitInline(unitType, unitId, ctx)`) usable from the same process. UOK kernel calls it directly when dispatching with `scope: 'inline'`. Must respect the single-writer invariant on `.sf/sf.db` (`sf-db.js`); the in-process call shares the kernel's existing WAL connection rather than opening a new one.
+- Why it matters: Today the unit executor is reachable only via subprocess argv parsing in the headless CLI surface. Without this extraction, R013's inline scope cannot wire a real executor — the dispatcher would have nothing to call. This is the prerequisite for R013.
+- Source: spec
+- Primary owning slice: unmapped
+- Supporting slices: none
+- Validation: unmapped
+- Notes: Reuses existing unit-context-manifest, prompt builders, and tool registries. The only change is execution surface: function call instead of process boundary. Session JSONL is still written for audit but to a path keyed off the in-process session ID, not a worker subprocess.
+
+### R015 — Spawn-Failure Loud Failure (Defensive)
+- Class: failure-visibility
+- Status: active
+- Description: Until R013/R014 land for every unit type, the existing spawn path must fail loudly. If a dispatched worker fails to write its session JSONL within a configurable timeout (default 30s) AND has zero `progressCount`, the runtime must (a) transition the unit to `status: failed`, (b) capture any stderr from the spawn into `lineage.events`, (c) emit a doctor-visible signal, and (d) trigger the retry path up to `maxRetries`. Today the runaway watchdog only fires a warning and never retries — `recoveryAttempts` stays at 0.
+- Why it matters: Even after inline scope retires the spawn path for the common cases, spawn-based dispatch will persist for milestone/slice-scope workers and parallel modes. Silent failure is the worst possible behavior — operator sees a "running" unit that's a ghost. This requirement keeps the spawn path observable for as long as it exists.
+- Source: spec
+- Primary owning slice: unmapped
+- Supporting slices: none
+- Validation: unmapped
+- Notes: Touches the runaway-recovery / unit-ownership / parallel-orchestrator surfaces. Distinct from R013 — R013 removes the bug for inline scope; R015 contains the bug for non-inline scope.
+
 ## Traceability

 | ID | Class | Status | Primary owner | Supporting | Proof |
@ -152,10 +185,13 @@ This file is the explicit capability and coverage contract for the project.
 | R010 | quality-attribute | active | M005/S02 | none | unmapped |
 | R011 | failure-visibility | active | unmapped | none | unmapped |
 | R012 | differentiator | active | unmapped | none | unmapped |
+| R013 | core-capability | active | unmapped | none | unmapped |
+| R014 | core-capability | active | unmapped | none | unmapped |
+| R015 | failure-visibility | active | unmapped | none | unmapped |

 ## Coverage Summary

- Active requirements: 12
+- Active requirements: 15
 - Mapped to slices: 10
 - Validated: 0
- Unmapped active requirements: 2 (R011, R012 — pending planning into a new self-heal extension slice or M003 follow-on)
+- Unmapped active requirements: 5 (R011, R012 — self-heal extension; R013, R014, R015 — UNIFIED_DISPATCH_V2 inline scope, anchored to docs/plans/UNIFIED_DISPATCH_V2_PLAN.md)
--- a/packages/coding-agent/src/core/session-manager.ts
+++ b/packages/coding-agent/src/core/session-manager.ts
@ -1086,15 +1086,17 @@ export class SessionManager {
 	_persist(entry: SessionEntry): void {
 		if (!this.persist || !this.sessionFile) return;

-		const hasAssistant = this.fileEntries.some(
-			(e) => e.type === "message" && e.message.role === "assistant",
-		);
-		if (!hasAssistant) {
-			// Mark as not flushed so when assistant arrives, all entries get written
-			this.flushed = false;
-			return;
-		}
-
+		// #R015-remediation (sf-mp8c0arc-vgw8io): previously this method
+		// deferred file creation until the first assistant message arrived
+		// (silent return on !hasAssistant). The intent was to avoid empty
+		// files for cancelled/never-started sessions, but the cost was
+		// silent invisibility when the LLM never produced an assistant
+		// message — failed sessions left zero forensic trail and the SF
+		// autonomous loop's watchdog couldn't tell a live session from a
+		// dead one. The eventual cost (debugging M005's chronic stuck
+		// state) far exceeded the saved disk space. We now write entries
+		// as soon as they're added, so the session JSONL exists with at
+		// least the session header + user prompt from the very first turn.
 		let release: (() => void) | undefined;
 		try {
 			release = tryAcquireLockSync(this.sessionFile);
--- a/src/resources/extensions/sf/auto-prompts.js
+++ b/src/resources/extensions/sf/auto-prompts.js
@ -1150,10 +1150,9 @@ export async function buildDiscussProjectPrompt(
 	});
 	const parts = [];
 	if (composed) parts.push(composed);
-	const knowledgeBlockDP = await inlineKnowledgeScoped(base, []);
-	if (knowledgeBlockDP) parts.push(knowledgeBlockDP);
-	const graphBlockDP = await inlineGraphSubgraph(base, "project setup", { budget: 3000 });
-	if (graphBlockDP) parts.push(graphBlockDP);
+	// #M005-remediation: knowledge/graph are computed artifacts already
+	// included in `composed` via the computed registry above. Manual
+	// re-injection here caused duplicate sections in the prompt output.
 	const inlinedContext = capPreamble(
 		`## Inlined Context (preloaded — do not re-read these files)\n\n${parts.join("\n\n---\n\n")}`,
 	);
@ -1208,12 +1207,7 @@ export async function buildDiscussRequirementsPrompt(
 	});
 	const parts = [];
 	if (composed) parts.push(composed);
-	const knowledgeBlockDR = await inlineKnowledgeScoped(base, []);
-	if (knowledgeBlockDR) parts.push(knowledgeBlockDR);
-	const graphBlockDR = await inlineGraphSubgraph(base, "project requirements", {
-		budget: 3000,
-	});
-	if (graphBlockDR) parts.push(graphBlockDR);
+	// #M005-remediation: knowledge/graph included via composed (computed registry).
 	const inlinedContext = capPreamble(
 		`## Inlined Context (preloaded — do not re-read these files)\n\n${parts.join("\n\n---\n\n")}`,
 	);
@ -1276,12 +1270,7 @@ export async function buildResearchProjectPrompt(
 	});
 	const parts = [];
 	if (composed) parts.push(composed);
-	const knowledgeBlockRP = await inlineKnowledgeScoped(base, []);
-	if (knowledgeBlockRP) parts.push(knowledgeBlockRP);
-	const graphBlockRP = await inlineGraphSubgraph(base, "project research", {
-		budget: 3000,
-	});
-	if (graphBlockRP) parts.push(graphBlockRP);
+	// #M005-remediation: knowledge/graph included via composed (computed registry).
 	const inlinedContext = capPreamble(
 		`## Inlined Context (preloaded — do not re-read these files)\n\n${parts.join("\n\n---\n\n")}`,
 	);
@ -1346,12 +1335,7 @@ export async function buildDiscussMilestonePrompt(
 	});
 	const parts = [];
 	if (composed) parts.push(composed);
-	const knowledgeBlockDM = await inlineKnowledgeScoped(base, []);
-	if (knowledgeBlockDM) parts.push(knowledgeBlockDM);
-	const graphBlockDM = await inlineGraphSubgraph(base, `${mid} ${midTitle}`, {
-		budget: 3000,
-	});
-	if (graphBlockDM) parts.push(graphBlockDM);
+	// #M005-remediation: knowledge/graph included via composed (computed registry).
 	const inlinedContext = capPreamble(
 		`## Inlined Context (preloaded — do not re-read these files)\n\n${parts.join("\n\n---\n\n")}`,
 	);
@ -1373,11 +1357,12 @@ export async function buildDiscussMilestonePrompt(
 	return basePrompt;
 }
 export async function buildResearchMilestonePrompt(mid, midTitle, base) {
-	// #4782 phase 3: research-milestone migrated through the composer.
-	// Declared inline order: milestone-context, project, requirements,
-	// decisions, templates. Knowledge stays outside the composer
-	// (budget-driven, scoped by keyword extraction — future phase folds
-	// policy-driven blocks in).
+	// #M005-remediation: research-milestone now fully delegates ordering
+	// to the v2 composer. The manifest declares knowledge as an inline
+	// artifact (positioned between decisions and templates) so its
+	// keyword-budgeted resolver runs in the correct slot. Graph is a
+	// computed artifact appended after templates. Eliminates the manual
+	// splice + duplicate-injection pattern previously inlined here.
 	const resolveArtifact = async (key) => {
 		switch (key) {
 			case "milestone-context": {
@ -1391,6 +1376,8 @@ export async function buildResearchMilestonePrompt(mid, midTitle, base) {
 				return await inlineRequirementsFromDb(base, mid);
 			case "decisions":
 				return await inlineDecisionsFromDb(base, mid);
+			case "knowledge":
+				return await inlineKnowledgeBudgeted(base, extractKeywords(midTitle));
 			case "templates":
 				return inlineTemplate("research", "Research");
 			default:
@ -1398,37 +1385,18 @@ export async function buildResearchMilestonePrompt(mid, midTitle, base) {
 		}
 	};
 	const { inline: composed } = await composeUnitContext("research-milestone", {
-		resolveArtifact,
-	});
-	// Knowledge block stays outside the composer — budgeted, scoped via
-	// keyword extraction (#4719). Inserted between decisions and the
-	// templates block to match the pre-migration output order. We split
-	// the composer output around the templates section to preserve that
-	// ordering.
-	const knowledgeInlineRM = await inlineKnowledgeBudgeted(
 		base,
-		extractKeywords(midTitle),
-	);
-	const graphBlockRM = await inlineGraphSubgraph(base, `${mid} ${midTitle}`, {
-		budget: 3000,
+		resolveArtifact,
+		computed: {
+			graph: {
+				build: async (_, b) =>
+					inlineGraphSubgraph(b, `${mid} ${midTitle}`, { budget: 3000 }),
+				inputs: {},
+			},
+		},
 	});
 	const parts = [];
-	if (knowledgeInlineRM && composed) {
-		// Insert knowledge before the template block so the overall order is:
-		//   milestone-context → project → requirements → decisions → KNOWLEDGE → research template
-		const idx = composed.lastIndexOf("### Output Template:");
-		if (idx > 0) {
-			const before = composed.slice(0, idx).replace(/\n\n---\n\n$/, "");
-			const after = composed.slice(idx);
-			parts.push(before, knowledgeInlineRM, after);
-		} else {
-			parts.push(composed, knowledgeInlineRM);
-		}
-	} else if (composed) {
-		parts.push(composed);
-		if (knowledgeInlineRM) parts.push(knowledgeInlineRM);
-	}
-	if (graphBlockRM) parts.push(graphBlockRM);
+	if (composed) parts.push(composed);
 	const inlinedContext = capPreamble(
 		`## Inlined Context (preloaded — do not re-read these files)\n\n${parts.join("\n\n---\n\n")}`,
 	);
--- a/src/resources/extensions/sf/auto-timers.js
+++ b/src/resources/extensions/sf/auto-timers.js
@ -271,6 +271,51 @@ export function startUnitSupervision(sctx) {
 					);
 					return;
 				}
+				if (decision.action === "fail") {
+					if (getInFlightToolCount() > 0) return;
+					await closeoutUnit(
+						ctx,
+						s.basePath,
+						s.currentUnit.type,
+						s.currentUnit.id,
+						s.currentUnit.startedAt,
+						buildSnapshotOpts(),
+					);
+					writeUnitRuntimeRecord(
+						s.basePath,
+						unitType,
+						unitId,
+						s.currentUnit.startedAt,
+						{
+							phase: "failed-silent-worker",
+							status: "failed",
+							lastProgressAt: Date.now(),
+							lastProgressKind: "runaway-guard-fail",
+							runawayGuardFail: decision.metadata,
+						},
+					);
+					const unitParts = unitId.split("/");
+					recordSelfFeedback(
+						{
+							kind: "runaway-loop:silent-worker-failure",
+							severity: "high",
+							summary: decision.reason,
+							evidence: JSON.stringify(decision.metadata, null, 2),
+							suggestedFix:
+								"LLM session never produced an assistant message — check session-manager.ts:1086-1096 (silent _persist skip) and verify the model/provider is responding. The dispatcher will attempt retry within maxRetries; if persistent, transitions to blocked.",
+							occurredIn: {
+								unitType,
+								milestone: unitParts[0],
+								slice: unitParts[1],
+								task: unitParts.slice(2).join("/") || undefined,
+							},
+							source: "detector",
+						},
+						s.basePath,
+					);
+					ctx.ui.notify(decision.reason, "error");
+					return;
+				}
 				if (decision.action === "pause") {
 					if (getInFlightToolCount() > 0) return;
 					await closeoutUnit(
--- a/src/resources/extensions/sf/prompts/research-milestone.md
+++ b/src/resources/extensions/sf/prompts/research-milestone.md
@ -24,13 +24,13 @@ Write for the roadmap planner. It needs to understand: what exists in the codeba

 ## Calibrate Depth

-Read the milestone title, the user's stated intent, and any inlined context above. Ask: does this milestone introduce new technology, span multiple unfamiliar subsystems, or have ambiguous scope? Or is it a focused feature in well-understood territory?
+**Default to deep research.** Read the milestone title, the user's stated intent, and any inlined context above. Use deep mode unless you can give a concrete one-line justification for downscoping. The cost of light research on a genuinely uncertain milestone (wrong slice boundaries, missed pitfalls, fabricated risk story by the planner downstream) is far greater than the cost of a thorough exploration on a milestone that turned out simple.

- **Deep research** — new technology, novel architecture, multiple risky integrations, or genuinely ambiguous scope. Explore broadly, look up docs, investigate alternatives. Write the full strategic frame including risks, boundaries, and slice-ordering rationale. This is the default when the milestone is genuinely uncertain.
- **Targeted research** — known technology but new to this codebase, or moderate complexity. Explore the relevant areas, check one or two libraries, identify constraints. Skip Comparable Systems if nothing applies.
- **Light research** — well-scoped milestone using established patterns already in the codebase. Read the relevant files to confirm the pattern, note constraints, write Summary + Recommendation + Implementation Landscape. A light milestone-research doc can be 30-50 lines. Don't manufacture risks or comparable-systems analysis for work that doesn't have them.
+- **Deep research (DEFAULT)** — explore broadly, look up docs via DeepWiki/Context7, run multiple web searches for comparable systems, investigate alternatives. Write the full strategic frame including risks, boundaries, comparable systems, and slice-ordering rationale. This is the assumption.
+- **Targeted research** — choose this only when the milestone is a well-defined feature in a familiar subsystem AND no novel technology is involved. Explore the relevant areas, check 1-2 libraries, identify constraints. Comparable Systems section is still required.
+- **Light research** — choose this ONLY when the milestone is trivial repetition of a pattern already in the codebase, with no external dependencies and no architectural decisions. State the explicit downscope reason in the Summary section so the planner sees why depth was reduced. A light milestone-research doc can be 30-50 lines.

-An honest "this milestone is straightforward, here's the pattern and slice boundaries" beats a fabricated multi-page exploration for work that doesn't need it.
+The previous calibration advice "an honest 'straightforward' beats a fabricated multi-page exploration" still holds — but the bar for declaring straightforward is high. If in doubt, go deep. Comparable Systems is MANDATORY for deep and targeted; only light research may omit it (and only with an explicit reason).

 ## Steps

@ -40,7 +40,7 @@ Research the codebase and relevant technologies. Narrate key findings and surpri
 3. Explore relevant code. Use native `lsp` first for symbol lookup, references, and cross-file navigation. For small/familiar codebases, use `rg`, `find`, and targeted reads. For large or unfamiliar codebases, use `scout` to build a broad map efficiently before diving in.
 3a. Use research swarms when the questions fan out cleanly. If the milestone spans 2-3 independent subsystems, dispatch parallel `scout`/`researcher` subagents with separate lenses, then synthesize their findings into one research artifact. Do not swarm one tightly-coupled question; do it inline.
 4. **Documentation lookup — prefer DeepWiki first.** Use `ask_question` / `read_wiki_structure` / `read_wiki_contents` (DeepWiki) as the default for any GitHub-hosted library or framework — AI-indexed, no free-tier cap. Fall back to `resolve_library` → `get_library_docs` (Context7) for npm/pypi/crates packages DeepWiki doesn't have. **Context7 free tier is capped at 1000 requests/month — spend those on cases DeepWiki can't cover.** Skip both for libraries already used in this codebase.
-5. **Web search budget:** You have a limited budget of web searches (max ~15 per session). Use them strategically — try DeepWiki → Context7 → web search in that order. Do NOT repeat the same or similar queries. If a search didn't find what you need, rephrase once or move on. Target 3-5 total web searches for a typical research unit.
+5. **Web search budget:** You have a budget of up to ~25 web searches per session. Use them strategically — try DeepWiki → Context7 → web search in that order. For deep research, target 8-12 web searches (comparable systems, prior art, library tradeoffs, common pitfalls); for targeted research, target 4-6; for light research, 0-2 is fine. Do NOT repeat the same or similar queries. If a search didn't find what you need, rephrase once or move on. Spend the budget on real questions, not safety nets.
 6. Use the **Research** output template from the inlined context above — include only sections that have real content
 7. If `.sf/REQUIREMENTS.md` exists, research against it. Identify which Active requirements are table stakes, likely omissions, overbuilt risks, or domain-standard behaviors the user may or may not want.
 8. Call `save_summary` with `milestone_id: {{milestoneId}}`, `artifact_type: "RESEARCH"`, and the full research markdown as `content` — the tool computes the file path and persists to both DB and disk.
--- a/src/resources/extensions/sf/templates/research.md
+++ b/src/resources/extensions/sf/templates/research.md
@ -64,6 +64,22 @@

 - {{riskThatCouldSurfaceDuringExecution}}

+## Comparable Systems
+
+<!-- MANDATORY for deep and targeted research. Document 2-3 real systems
+(open-source projects, papers, or published architectures) that solved a
+similar problem. Look them up via DeepWiki/Context7/web search. For each:
+what approach did they take, what tradeoffs did they accept, and what
+should we steal vs. avoid. Light research may skip this section ONLY if
+the milestone is trivial repetition of an existing in-repo pattern. Do
+not fabricate — if you can't find genuine comparables, state that
+explicitly and explain why none apply. -->
+
+| System | Approach | Tradeoffs | What to steal / avoid |
+|--------|----------|-----------|------------------------|
+| {{system1}} | {{approach1}} | {{tradeoffs1}} | {{stealOrAvoid1}} |
+| {{system2}} | {{approach2}} | {{tradeoffs2}} | {{stealOrAvoid2}} |
+
 ## Skills Discovered

 <!-- Include when skill discovery found relevant skills. -->
--- a/src/resources/extensions/sf/unit-context-manifest.js
+++ b/src/resources/extensions/sf/unit-context-manifest.js
@ -147,17 +147,23 @@ export const UNIT_MANIFESTS = {
 		preferences: "active-only",
 		tools: TOOLS_PLANNING,
 		artifacts: {
-			// Phase 3 migration (#4782): matches today's actual
-			// buildResearchMilestonePrompt inlining order.
+			// #M005-remediation: knowledge resolved as an inline artifact so its
+			// position (between decisions and templates) is preserved by the
+			// composer's declared-order traversal. Graph stays as a computed
+			// artifact (always appended after templates), matching the prior
+			// builder's behavior. Eliminates the manual splice that previously
+			// existed in buildResearchMilestonePrompt.
 			inline: [
 				"milestone-context",
 				"project",
 				"requirements",
 				"decisions",
+				"knowledge",
 				"templates",
 			],
 			excerpt: [],
 			onDemand: [],
+			computed: ["graph"],
 		},
 		maxSystemPromptChars: COMMON_BUDGET_MEDIUM,
 	},
--- a/src/resources/extensions/sf/uok/auto-runaway-guard.js
+++ b/src/resources/extensions/sf/uok/auto-runaway-guard.js
@ -223,6 +223,47 @@ export function evaluateRunawayGuard(
 	) {
 		return { action: "none" };
 	}
+	// Silent-worker-failure detection (#sf-mp8c0arc-vgw8io):
+	// When the final warning has already been sent and the unit has produced
+	// zero tool calls past the elapsed threshold, the worker is not stuck in a
+	// busy loop — its LLM session never produced an assistant message, so the
+	// session JSONL was never written (session-manager.ts:1086-1096 skips
+	// _persist when !hasAssistant). hasMeaningfulGrowth is false because no
+	// tokens are flowing, so the existing pause branch never fires. Escalate
+	// to fail so the dispatcher can retry or transition to blocked instead of
+	// staying in runaway-final-warning-sent indefinitely.
+	if (
+		s.finalWarningSent &&
+		(unitMetrics.toolCalls ?? 0) === 0 &&
+		unitMetrics.elapsedMs > config.elapsedMs
+	) {
+		const reason =
+			`Runaway guard fail ${unitType} ${unitId}: zero tool calls in ` +
+			`${Math.round(unitMetrics.elapsedMs / 1000)}s after final warning — ` +
+			`silent worker failure suspected (LLM session never produced an assistant message).`;
+		return {
+			action: "fail",
+			reason,
+			metadata: {
+				reason,
+				failedAt: now,
+				unitType,
+				unitId,
+				diagnosticTurns: config.diagnosticTurns,
+				warningsSent: s.warningsSent,
+				thresholdReasons: reasons,
+				metrics: unitMetrics,
+				silentFailure: true,
+				thresholds: {
+					toolCallWarning: config.toolCallWarning,
+					tokenWarning: config.tokenWarning,
+					elapsedMs: config.elapsedMs,
+					changedFilesWarning: config.changedFilesWarning,
+					minIntervalMs: config.minIntervalMs,
+				},
+			},
+		};
+	}
 	if (
 		config.hardPause &&
 		s.finalWarningSent &&