sf snapshot: pre-dispatch, uncommitted changes after 120m inactivity

2026-04-30 17:44:03 +02:00 · 2026-04-30 17:44:03 +02:00 · e90298f2e0
commit e90298f2e0
parent d8a9d63c87
32 changed files with 667 additions and 2967 deletions
--- a/.gitignore
+++ b/.gitignore
@ -92,3 +92,4 @@ bun.lock
 .envrc
 .serena/
 repowise.db
+.sf/mcp.json
--- a/SPEC.md
+++ b/SPEC.md
--- a/TODO.md
+++ b/TODO.md
@ -1,191 +1,2 @@
 # TODO

-Dump anything here.
-
-SF agentic engineering / harness / memory / eval context dump:
-
-We want a low-friction dump inbox that turns rough human notes into project
-evals, harness work, memory requirements, docs, tests, or implementation tasks.
-Root TODO.md is the dump place. AGENTS.md carries the durable instruction:
-agents should read TODO.md when present, triage it, and clear processed notes
-after converting them into reviewable artifacts.
-
-Important split:
- AGENTS.md = durable startup-visible instructions.
- TODO.md = messy temporary dump inbox.
- Memory = experience store.
- GEPA/DSPy/self-evolution = offline lab.
- Runtime agent = uses approved skills/prompts/tools/memory, not unreviewed
-  evolved candidates.
-
-Harness.io note:
- Harness Agents are AI workers inside Harness CI/CD pipelines.
- They inherit pipeline context, secrets, RBAC, approvals, logs, and OPA policy.
- Useful SF lesson: run agents inside a governed workflow with permissions,
-  logs, approvals, artifacts, reusable templates, and reviewable outputs.
- This is different from repo-native test/eval harnesses, but the control-plane
-  pattern is valuable.
-
-Current SF state:
- Auto-mode safety harness exists and is default-on: evidence collection,
-  file-change validation, evidence cross-reference, destructive command
-  warnings, content validation, checkpoints. Auto rollback is off by default.
- gate-evaluate exists but is opt-in via gate_evaluation.enabled.
- Repo-native harness evolution is mostly read-only/proposed today:
-  /sf harness profile records repo facts in .sf/sf.db, but does not yet enforce
-  harness/manifest gates or write harness/, gates/, eval suites, or CI files.
-
-Slow conversion of TS into fast agents:
- Do not rewrite the deterministic SF state machine into LLM behavior.
- Keep TypeScript for CLI, TUI, extension API, preferences, state machine, DB
-  schema, safety gates, prompt rendering, workflow orchestration, and file
-  ownership rules.
- Convert fuzzy/read-only work into narrow agents: repo profiling
-  interpretation, TODO triage, eval generation, harness proposal, failure
-  analysis, review, remediation proposals, memory extraction, drift detection.
- SF remains the orchestrator and ledger. Agents consume typed jobs and return
-  structured JSON.
-
-Possible AgentJob shape:
-
-type AgentJob =
-  | { kind: "repo_profile"; cwd: string }
-  | { kind: "todo_triage"; cwd: string; todoPath: string }
-  | { kind: "eval_candidate_generation"; cwd: string; sources: string[] }
-  | { kind: "failure_analysis"; cwd: string; runId: string }
-  | { kind: "harness_proposal"; cwd: string; profileId: string };
-
-First useful agents:
- TODO triage agent: reads TODO.md, creates eval candidates, implementation
-  tasks, memory facts, docs/harness suggestions, then clears processed notes.
- Eval candidate agent: converts notes/session failures into JSONL with
-  task_input, expected_behavior, failure_mode, evidence, source.
- Repo profile interpretation agent: uses deterministic TS repo-profiler output
-  and identifies missing gates/evals/docs.
- Harness proposal agent: produces dry-run proposals only; no tracked file
-  writes except reviewed artifacts later.
- Remediation agent: later, after evals are stable, takes failing evals and
-  proposes code/test patches.
-
-Speed strategy:
- Deterministic TS: scan files, parse manifests, read git state, write DB rows.
- Cheap/local model agents: classify dump notes, summarize failures, label risk.
- Strong model agents: propose harnesses, generate eval rubrics, repair complex
-  failures.
-
-Desired pipeline:
-TODO.md dump -> triage agent -> eval candidate JSONL / backlog / docs / tests
-> reviewed project artifact -> eval suite / harness gate -> self-evolution
-can consume later.
-
-Potential eval candidate JSONL shape:
-
-{
-  "id": "sf.todo-triage.001",
-  "task_input": "...",
-  "expected_behavior": "...",
-  "failure_mode": "...",
-  "evidence": "...",
-  "source": "TODO.md"
-}
-
-Self-evolution principle:
- Repeated failure -> add eval first, then fix behavior.
- Raw memory/dump notes are evidence, not approved behavior.
- GEPA/DSPy output must become reviewable diffs against skills/prompts/tool
-  descriptions and pass held-out evals plus deterministic gates.
-
-GEPA/DSPy placement across SF vs memory/brain:
- GEPA/DSPy should not run inside normal SF runtime turns and should not live
-  as direct mutable memory behavior.
- SF owns the project workflow control plane: TODO triage, backlog handoff,
-  eval artifacts, harness proposals, deterministic gates, reviewed diffs, and
-  dispatch rules.
- Memory/brain owns durable experience: session traces, user corrections,
-  repeated failures, successful patterns, evidence IDs, source sessions, and
-  recall/export APIs.
- Memory/brain should expose dataset export surfaces for SF/self-evolution:
-  "give me candidate eval cases for this repo/risk/skill/tool from past
-  evidence".
- GEPA/DSPy consumes approved eval datasets and memory-exported candidates
-  offline, proposes prompt/skill/tool-description diffs, and hands those diffs
-  back to SF as reviewable implementation work.
- Accepted GEPA outputs become tracked repo artifacts or versioned SF resources,
-  not raw memory entries.
- Future home should be an offline evolution runner, either a separate repo
-  such as `singularity-evolution` or a clearly isolated SF package/command such
-  as `packages/evolution` plus `/sf evolve ...`. It should read
-  `.sf/triage/evals/*.evals.jsonl`, approved harness evals, and memory-exported
-  eval candidates; run DSPy/GEPA; then write candidate diffs/reports under
-  `.sf/evolution/` or a review branch. It must not mutate live prompts,
-  skills, memory, or tool descriptions directly.
- End state: ACE Coder is the consolidation target for brain/memory,
-  self-evolution, and agent workbench capabilities. It already has memory tiers
-  and an evolution workspace, so it should eventually host the optimizer and
-  long-running experiment service: consume SF eval artifacts and Singularity
-  Memory exports, run GEPA/DSPy/genetic search, then return reports and
-  candidate diffs to SF.
- Near-term rule: keep execution in SF. ACE Coder can be the eventual
-  consolidation target, but its execution loop is not as battle-tested as SF
-  today. Start with SF's working tools, explicit artifacts, and deterministic
-  gates; move capabilities behind stable contracts only after they are proven.
- `singularity-memory` should migrate into ACE over time, but through a bridge
-  rather than a wholesale copy. Keep the SF memory plugin contract stable, map
-  Singularity Memory evidence/export APIs onto ACE memory concepts, compare
-  quality/latency/operability, then swap the backend when ACE satisfies the
-  contract.
- Checked finding: Singularity Memory is the better current external brain
-  contract for SF/Crush-style runners. It already has standalone MCP+HTTP,
-  bank isolation, retain/recall/reflect, OpenAPI clients, thin tool adapters,
-  VectorChord/BM25/RRF retrieval, optional reranking, and a Go migration path.
-  ACE should eventually host this, but SF should keep targeting the
-  Singularity Memory contract until ACE proves parity behind that same
-  boundary.
- Target topology: ACE is the central brain/workbench/evolution service;
-  lightweight repo-local runners such as SF, Crush, or customer-approved
-  agents run inside customer repositories. Those runners collect traces,
-  triage TODO/self-report inputs, execute deterministic gates, and submit
-  evidence/results back to ACE. ACE learns, evolves prompts/skills/tools
-  offline, and returns reviewed candidate diffs or policies for the local
-  runner to apply.
- SF-to-Crush direction: preserve the parts of SF that are already working
-  well--AGENTS/TODO triage, `.sf/triage` artifacts, backlog promotion,
-  harness/eval gates, dispatch rules, and reviewable diffs--but make them
-  usable from a Crush-style repo-local runner. In that shape, Crush is the
-  customer-repo execution surface, SF is the workflow/gate library or adapter,
-  and ACE Coder is the linked brain/workbench that stores memory, runs
-  evolution, and sends back policies or candidate patches.
- SF-to-vtcode/Rust direction: port the hot, deterministic SF pieces toward a
-  Rust/vtcode-style core over time: repo scanning, artifact IO, dispatch state,
-  gate execution, JSONL triage stores, and local runner protocol glue. Keep the
-  current TS implementation as the working reference until the Rust path proves
-  parity.
- UX/runtime preference: keep Charm-style terminal UX where it adds operator
-  clarity, and keep Crush in view as the fast repo-local execution surface.
-  Rust/vtcode should optimize the core and protocol layer, not erase the good
-  local workflow experience.
- ACE creates/manages agents, memories, eval suites, skills, and policies.
-  External/customer repos stay outside the ACE server boundary: repo-local
-  runners own checkout access, file edits, tests, secrets exposure, and side
-  effects, then report traces/results/artifacts back to ACE.
-
-Proper info flow:
- Raw human dump: root TODO.md.
- Raw agent self-report: .sf/BACKLOG.md and ~/.sf/agent/upstream-feedback.jsonl.
- Raw session-derived evidence: Singularity Memory / brain.
- First normalizer: /sf todo triage for TODO.md now; future /sf inbox triage
-  should normalize TODO.md + self-feedback + memory exports through the same
-  schema.
- Normalized pending items live in .sf/triage/inbox/*.jsonl with source, kind,
-  evidence, status, and created_at.
- Human-readable triage reports live in .sf/triage/reports/*.md.
- Eval-ready cases live in .sf/triage/evals/*.evals.jsonl.
- Human/planner-visible implementation tasks may be copied into .sf/BACKLOG.md
-  with /sf todo triage --backlog, but auto-mode must not execute backlog
-  directly. Planning/reassessment proposes promotion; user or explicit command
-  approves promotion into roadmap/slice/task artifacts.
- Memory-worthy notes are retained by memory/brain only after triage attaches
-  evidence/source; raw TODO notes are not memory.
- Preferred triage model tier: MiniMax M2.7 highspeed when available, then
-  MiniMax M2.5 highspeed, then other cheap/fast classification models. Triage
-  is structuring/classification, not final code editing.
--- a/bin/sf-from-source
+++ b/bin/sf-from-source
@ -3,8 +3,13 @@
 # sf-from-source — run SF directly from this source checkout via node.
 #
 # Purpose: every local commit in this repo is live immediately without
-# rebuilding dist/. Subagents can spawn sf by pointing SF_BIN_PATH at
-# this script instead of dist/loader.js.
+# rebuilding dist/. Human CLI invocations use this bash shim for better
+# shell integration (set -e, pipefail, etc.).
+#
+# Subagents: SF_BIN_PATH is exported as dist/loader.js (not this shim), so
+# all child pi processes spawned by the subagent extension use dist/loader.js
+# directly as their entry point. dist/loader.js is a proper Node.js shebang
+# entry point, avoiding the bash-script-vs-node parsing issue.
 #
 # Why node, not bun:
 #   - bun doesn't ship node:sqlite (sf-db.ts falls back to filesystem-
@ -18,11 +23,9 @@
 #     resolution.
 #
 # Contract:
-#   - Executable shim; spawn() / exec() can launch directly.
-#   - Exports SF_BIN_PATH before handing off to loader.ts so loader.ts's
-#     `SF_BIN_PATH ||= process.argv[1]` branch preserves the shim path
-#     instead of clobbering it with the .ts loader path (which is not
-#     directly executable by child_process.spawn).
+#   - Executable shim; human CLI entry point with full shell features.
+#   - Exports SF_BIN_PATH=dist/loader.js so all child processes (including
+#     subagent pi instances) use the Node.js entry point directly.
 #
 # Requirements: node >= 22.5 on PATH (24+ recommended for strip-types),
 # node_modules populated.
@ -37,7 +40,11 @@ if [[ "${1:-}" == "headless" ]]; then
 	echo "[forge] Preparing source runtime for headless command..."
 fi

-export SF_BIN_PATH="$SCRIPT_DIR/sf-from-source"
+# SF_BIN_PATH: absolute path to dist/loader.js (not this shim).
+# This is what the subagent extension spawns for child pi processes.
+# dist/loader.js is a proper Node.js entry point — bash scripts cannot be
+# spawned by Node.js as executables (Node parses them as JS, causing SyntaxError).
+export SF_BIN_PATH="$SF_SOURCE_ROOT/dist/loader.js"
 export SF_CLI_PATH="${SF_CLI_PATH:-$SCRIPT_DIR/sf-from-source}"

 "$NODE_BIN" "$SF_SOURCE_ROOT/scripts/ensure-source-resources.cjs"
--- a/packages/pi-ai/src/providers/openai-completions.ts
+++ b/packages/pi-ai/src/providers/openai-completions.ts
@ -31,6 +31,7 @@ import type {
 import { AssistantMessageEventStream } from "../utils/event-stream.js";
 import { parseStreamingJson } from "../utils/json-parse.js";
 import { sanitizeSurrogates } from "../utils/sanitize-unicode.js";
+import { sanitizeToolCallArgumentsForSerialization } from "./sanitize-tool-arguments.js";
 import { buildBaseOptions, clampReasoning, resolveReasoningLevel } from "./simple-options.js";
 import {
 	assertStreamSuccess,
@ -562,7 +563,9 @@ export function convertMessages(
 					type: "function" as const,
 					function: {
 						name: tc.name,
-						arguments: JSON.stringify(tc.arguments),
+						arguments: JSON.stringify(
+							sanitizeToolCallArgumentsForSerialization(tc.arguments),
+						),
 					},
 				}));
 				const reasoningDetails = toolCalls
--- a/packages/pi-ai/src/providers/openai-responses-shared.ts
+++ b/packages/pi-ai/src/providers/openai-responses-shared.ts
@ -30,6 +30,7 @@ import type { AssistantMessageEventStream } from "../utils/event-stream.js";
 import { shortHash } from "../utils/hash.js";
 import { parseStreamingJson } from "../utils/json-parse.js";
 import { sanitizeSurrogates } from "../utils/sanitize-unicode.js";
+import { sanitizeToolCallArgumentsForSerialization } from "./sanitize-tool-arguments.js";
 import { transformMessagesWithReport } from "./transform-messages.js";

 // =============================================================================
@ -199,7 +200,9 @@ export function convertResponsesMessages<TApi extends Api>(
 						id: itemId,
 						call_id: callId,
 						name: toolCall.name,
-						arguments: JSON.stringify(toolCall.arguments),
+						arguments: JSON.stringify(
+							sanitizeToolCallArgumentsForSerialization(toolCall.arguments),
+						),
 					});
 				}
 			}
--- a/packages/pi-coding-agent/src/modes/interactive/controllers/input-controller.test.ts
+++ b/packages/pi-coding-agent/src/modes/interactive/controllers/input-controller.test.ts
@ -15,12 +15,16 @@ function getSlashCommandName(text: string): string {

 function createHost(options: HostOptions = {}) {
 	const prompted: string[] = [];
+	const promptOptions: unknown[] = [];
 	const errors: string[] = [];
 	const warnings: string[] = [];
 	const history: string[] = [];
 	const knownSlashCommands = new Set(options.knownSlashCommands ?? []);
 	let editorText = "";
 	let settingsOpened = 0;
+	let aborts = 0;
+	let pendingDisplayUpdates = 0;
+	let renderRequests = 0;

 	const editor = {
 		setText(text: string) {
@ -35,18 +39,26 @@ function createHost(options: HostOptions = {}) {
 	};

 	const host = {
-		defaultEditor: editor as typeof editor & { onSubmit?: (text: string) => Promise<void> },
+		defaultEditor: editor as typeof editor & {
+			onSubmit?: (text: string) => Promise<void>;
+		},
 		editor,
 		session: {
 			isBashRunning: false,
 			isCompacting: false,
 			isStreaming: false,
-			prompt: async (text: string) => {
+			prompt: async (text: string, options?: unknown) => {
 				prompted.push(text);
+				promptOptions.push(options);
+			},
+			abort: async () => {
+				aborts += 1;
 			},
 		},
 		ui: {
-			requestRender() {},
+			requestRender() {
+				renderRequests += 1;
+			},
 		},
 		getSlashCommandContext: () => ({
 			showSettingsSelector: () => {
@ -68,46 +80,94 @@ function createHost(options: HostOptions = {}) {
 			return knownSlashCommands.has(getSlashCommandName(text));
 		},
 		queueCompactionMessage() {},
-		updatePendingMessagesDisplay() {},
+		updatePendingMessagesDisplay() {
+			pendingDisplayUpdates += 1;
+		},
 		flushPendingBashComponents() {},
+		contextualTips: {
+			evaluate: () => undefined,
+			recordBashIncluded() {},
+		},
+		getContextPercent: () => undefined,
 	};

 	setupEditorSubmitHandler(host as any);

 	return {
-		host: host as typeof host & { defaultEditor: typeof editor & { onSubmit: (text: string) => Promise<void> } },
+		host: host as typeof host & {
+			defaultEditor: typeof editor & {
+				onSubmit: (text: string) => Promise<void>;
+			};
+		},
 		prompted,
+		promptOptions,
 		errors,
 		warnings,
 		history,
 		getEditorText: () => editorText,
 		getSettingsOpened: () => settingsOpened,
+		getAborts: () => aborts,
+		getPendingDisplayUpdates: () => pendingDisplayUpdates,
+		getRenderRequests: () => renderRequests,
 	};
 }

 test("input-controller: built-in slash commands stay in TUI dispatch", async () => {
-	const { host, prompted, errors, getSettingsOpened, getEditorText } = createHost();
+	const { host, prompted, errors, getSettingsOpened, getEditorText } =
+		createHost();

 	await host.defaultEditor.onSubmit("/settings");

-	assert.equal(getSettingsOpened(), 1, "built-in /settings should open the settings selector");
-	assert.deepEqual(prompted, [], "built-in slash commands should not reach session.prompt");
-	assert.deepEqual(errors, [], "built-in slash commands should not show errors");
-	assert.equal(getEditorText(), "", "built-in slash commands should clear the editor after handling");
+	assert.equal(
+		getSettingsOpened(),
+		1,
+		"built-in /settings should open the settings selector",
+	);
+	assert.deepEqual(
+		prompted,
+		[],
+		"built-in slash commands should not reach session.prompt",
+	);
+	assert.deepEqual(
+		errors,
+		[],
+		"built-in slash commands should not show errors",
+	);
+	assert.equal(
+		getEditorText(),
+		"",
+		"built-in slash commands should clear the editor after handling",
+	);
 });

 test("input-controller: extension slash commands fall through to session.prompt", async () => {
-	const { host, prompted, errors, history } = createHost({ knownSlashCommands: ["sf"] });
+	const { host, prompted, errors, history } = createHost({
+		knownSlashCommands: ["sf"],
+	});

 	await host.defaultEditor.onSubmit("/sf help");

-	assert.deepEqual(prompted, ["/sf help"], "known extension slash commands should reach session.prompt");
-	assert.deepEqual(errors, [], "known extension slash commands should not show unknown-command errors");
-	assert.deepEqual(history, ["/sf help"], "known extension slash commands should still be added to history");
+	assert.deepEqual(
+		prompted,
+		["/sf help"],
+		"known extension slash commands should reach session.prompt",
+	);
+	assert.deepEqual(
+		errors,
+		[],
+		"known extension slash commands should not show unknown-command errors",
+	);
+	assert.deepEqual(
+		history,
+		["/sf help"],
+		"known extension slash commands should still be added to history",
+	);
 });

 test("input-controller: prompt template slash commands fall through to session.prompt", async () => {
-	const { host, prompted, errors } = createHost({ knownSlashCommands: ["daily"] });
+	const { host, prompted, errors } = createHost({
+		knownSlashCommands: ["daily"],
+	});

 	await host.defaultEditor.onSubmit("/daily focus area");

@ -116,7 +176,9 @@ test("input-controller: prompt template slash commands fall through to session.p
 });

 test("input-controller: skill slash commands fall through to session.prompt", async () => {
-	const { host, prompted, errors } = createHost({ knownSlashCommands: ["skill:create-skill"] });
+	const { host, prompted, errors } = createHost({
+		knownSlashCommands: ["skill:create-skill"],
+	});

 	await host.defaultEditor.onSubmit("/skill:create-skill routing bug");

@ -130,7 +192,9 @@ test("input-controller: disabled skill slash commands stay unknown", async () =>
 	await host.defaultEditor.onSubmit("/skill:create-skill routing bug");

 	assert.deepEqual(prompted, []);
-	assert.deepEqual(errors, ["Unknown command: /skill:create-skill. Use slash autocomplete to see available commands."]);
+	assert.deepEqual(errors, [
+		"Unknown command: /skill:create-skill. Use slash autocomplete to see available commands.",
+	]);
 });

 test("input-controller: /export prefix does not swallow unrelated slash commands", async () => {
@ -139,7 +203,9 @@ test("input-controller: /export prefix does not swallow unrelated slash commands
 	await host.defaultEditor.onSubmit("/exportfoo");

 	assert.deepEqual(prompted, []);
-	assert.deepEqual(errors, ["Unknown command: /exportfoo. Use slash autocomplete to see available commands."]);
+	assert.deepEqual(errors, [
+		"Unknown command: /exportfoo. Use slash autocomplete to see available commands.",
+	]);
 });

 test("input-controller: truly unknown slash commands stop before session.prompt", async () => {
@ -147,12 +213,19 @@ test("input-controller: truly unknown slash commands stop before session.prompt"

 	await host.defaultEditor.onSubmit("/definitely-not-a-command");

-	assert.deepEqual(prompted, [], "unknown slash commands should not reach session.prompt");
 	assert.deepEqual(
-		errors,
-		["Unknown command: /definitely-not-a-command. Use slash autocomplete to see available commands."],
+		prompted,
+		[],
+		"unknown slash commands should not reach session.prompt",
+	);
+	assert.deepEqual(errors, [
+		"Unknown command: /definitely-not-a-command. Use slash autocomplete to see available commands.",
+	]);
+	assert.equal(
+		getEditorText(),
+		"",
+		"unknown slash commands should clear the editor after showing the error",
 	);
-	assert.equal(getEditorText(), "", "unknown slash commands should clear the editor after showing the error");
 });

 test("input-controller: absolute file paths are not treated as slash commands (#3478)", async () => {
@ -160,8 +233,16 @@ test("input-controller: absolute file paths are not treated as slash commands (#

 	await host.defaultEditor.onSubmit("/Users/name/Desktop/screenshot.png");

-	assert.deepEqual(errors, [], "file paths should not trigger unknown command error");
-	assert.deepEqual(prompted, ["/Users/name/Desktop/screenshot.png"], "file paths should be sent as plain input");
+	assert.deepEqual(
+		errors,
+		[],
+		"file paths should not trigger unknown command error",
+	);
+	assert.deepEqual(
+		prompted,
+		["/Users/name/Desktop/screenshot.png"],
+		"file paths should be sent as plain input",
+	);
 });

 test("input-controller: Linux absolute paths are not treated as slash commands (#3478)", async () => {
@ -169,8 +250,16 @@ test("input-controller: Linux absolute paths are not treated as slash commands (

 	await host.defaultEditor.onSubmit("/home/user/documents/file.txt");

-	assert.deepEqual(errors, [], "Linux paths should not trigger unknown command error");
-	assert.deepEqual(prompted, ["/home/user/documents/file.txt"], "Linux paths should be sent as plain input");
+	assert.deepEqual(
+		errors,
+		[],
+		"Linux paths should not trigger unknown command error",
+	);
+	assert.deepEqual(
+		prompted,
+		["/home/user/documents/file.txt"],
+		"Linux paths should be sent as plain input",
+	);
 });

 test("input-controller: /tmp paths are not treated as slash commands (#3478)", async () => {
@ -181,3 +270,53 @@ test("input-controller: /tmp paths are not treated as slash commands (#3478)", a
 	assert.deepEqual(errors, []);
 	assert.deepEqual(prompted, ["/tmp/some-file.log"]);
 });
+
+test("input-controller: dot aborts streaming instead of steering", async () => {
+	const {
+		host,
+		prompted,
+		history,
+		getAborts,
+		getEditorText,
+		getPendingDisplayUpdates,
+		getRenderRequests,
+	} = createHost();
+	host.session.isStreaming = true;
+
+	await host.defaultEditor.onSubmit(".");
+
+	assert.equal(getAborts(), 1, "dot should abort the active stream");
+	assert.deepEqual(prompted, [], "dot should not be sent as a steering prompt");
+	assert.deepEqual(history, ["."], "dot abort should remain in input history");
+	assert.equal(getEditorText(), "", "dot abort should clear the editor");
+	assert.equal(getPendingDisplayUpdates(), 1);
+	assert.equal(getRenderRequests(), 1);
+});
+
+test("input-controller: normal input while streaming is buffered as steering", async () => {
+	const {
+		host,
+		prompted,
+		promptOptions,
+		history,
+		getAborts,
+		getEditorText,
+		getPendingDisplayUpdates,
+		getRenderRequests,
+	} = createHost();
+	host.session.isStreaming = true;
+
+	await host.defaultEditor.onSubmit("use the simpler parser");
+
+	assert.equal(getAborts(), 0, "normal streaming input must not abort");
+	assert.deepEqual(prompted, ["use the simpler parser"]);
+	assert.deepEqual(promptOptions, [{ streamingBehavior: "steer" }]);
+	assert.deepEqual(history, ["use the simpler parser"]);
+	assert.equal(
+		getEditorText(),
+		"",
+		"streaming steering should clear the editor",
+	);
+	assert.equal(getPendingDisplayUpdates(), 1);
+	assert.equal(getRenderRequests(), 1);
+});
--- a/packages/pi-coding-agent/src/modes/interactive/controllers/input-controller.ts
+++ b/packages/pi-coding-agent/src/modes/interactive/controllers/input-controller.ts
@ -1,36 +1,46 @@
-import { dispatchSlashCommand } from "../slash-command-handlers.js";
-import type { InteractiveModeStateHost } from "../interactive-mode-state.js";
 import type { ContextualTips } from "../../../core/contextual-tips.js";
+import type { InteractiveModeStateHost } from "../interactive-mode-state.js";
+import { dispatchSlashCommand } from "../slash-command-handlers.js";

-export function setupEditorSubmitHandler(host: InteractiveModeStateHost & {
-	getSlashCommandContext: () => any;
-	handleBashCommand: (command: string, excludeFromContext?: boolean) => Promise<void>;
-	showWarning: (message: string) => void;
-	showError: (message: string) => void;
-	showTip: (message: string) => void;
-	updateEditorBorderColor: () => void;
-	isExtensionCommand: (text: string) => boolean;
-	isKnownSlashCommand: (text: string) => boolean;
-	queueCompactionMessage: (text: string, mode: "steer" | "followUp") => void;
-	updatePendingMessagesDisplay: () => void;
-	flushPendingBashComponents: () => void;
-	contextualTips: ContextualTips;
-	getContextPercent: () => number | undefined;
-	options?: { submitPromptsDirectly?: boolean };
-}): void {
+export function setupEditorSubmitHandler(
+	host: InteractiveModeStateHost & {
+		getSlashCommandContext: () => any;
+		handleBashCommand: (
+			command: string,
+			excludeFromContext?: boolean,
+		) => Promise<void>;
+		showWarning: (message: string) => void;
+		showError: (message: string) => void;
+		showTip: (message: string) => void;
+		updateEditorBorderColor: () => void;
+		isExtensionCommand: (text: string) => boolean;
+		isKnownSlashCommand: (text: string) => boolean;
+		queueCompactionMessage: (text: string, mode: "steer" | "followUp") => void;
+		updatePendingMessagesDisplay: () => void;
+		flushPendingBashComponents: () => void;
+		contextualTips: ContextualTips;
+		getContextPercent: () => number | undefined;
+		options?: { submitPromptsDirectly?: boolean };
+	},
+): void {
 	host.defaultEditor.onSubmit = async (text: string) => {
 		text = text.trim();
 		if (!text) return;

 		if (text.startsWith("/") && !looksLikeFilePath(text)) {
-			const handled = await dispatchSlashCommand(text, host.getSlashCommandContext());
+			const handled = await dispatchSlashCommand(
+				text,
+				host.getSlashCommandContext(),
+			);
 			if (handled) {
 				host.editor.setText("");
 				return;
 			}
 			if (!host.isKnownSlashCommand(text)) {
 				const command = text.split(/\s/)[0];
-				host.showError(`Unknown command: ${command}. Use slash autocomplete to see available commands.`);
+				host.showError(
+					`Unknown command: ${command}. Use slash autocomplete to see available commands.`,
+				);
 				host.editor.setText("");
 				return;
 			}
@ -41,7 +51,9 @@ export function setupEditorSubmitHandler(host: InteractiveModeStateHost & {
 			const command = isExcluded ? text.slice(2).trim() : text.slice(1).trim();
 			if (command) {
 				if (host.session.isBashRunning) {
-					host.showWarning("A bash command is already running. Press Esc to cancel it first.");
+					host.showWarning(
+						"A bash command is already running. Press Esc to cancel it first.",
+					);
 					host.editor.setText(text);
 					return;
 				}
@ -75,7 +87,8 @@ export function setupEditorSubmitHandler(host: InteractiveModeStateHost & {
 				try {
 					await host.session.prompt(text);
 				} catch (error: unknown) {
-					const errorMessage = error instanceof Error ? error.message : "Unknown error occurred";
+					const errorMessage =
+						error instanceof Error ? error.message : "Unknown error occurred";
 					host.showError(errorMessage);
 				}
 			} else {
@ -85,6 +98,14 @@ export function setupEditorSubmitHandler(host: InteractiveModeStateHost & {
 		}

 		if (host.session.isStreaming) {
+			if (text === ".") {
+				host.editor.addToHistory?.(text);
+				host.editor.setText("");
+				await host.session.abort();
+				host.updatePendingMessagesDisplay();
+				host.ui.requestRender();
+				return;
+			}
 			host.editor.addToHistory?.(text);
 			host.editor.setText("");
 			await host.session.prompt(text, { streamingBehavior: "steer" });
@ -106,7 +127,8 @@ export function setupEditorSubmitHandler(host: InteractiveModeStateHost & {
 			try {
 				await host.session.prompt(text);
 			} catch (error: unknown) {
-				const errorMessage = error instanceof Error ? error.message : "Unknown error occurred";
+				const errorMessage =
+					error instanceof Error ? error.message : "Unknown error occurred";
 				host.showError(errorMessage);
 			}
 			return;
@ -118,7 +140,8 @@ export function setupEditorSubmitHandler(host: InteractiveModeStateHost & {
 		try {
 			await host.session.prompt(text);
 		} catch (error: unknown) {
-			const errorMessage = error instanceof Error ? error.message : "Unknown error occurred";
+			const errorMessage =
+				error instanceof Error ? error.message : "Unknown error occurred";
 			host.showError(errorMessage);
 		}
 	};
--- a/src/resources/extensions/guardrails/index.ts
+++ b/src/resources/extensions/guardrails/index.ts
@ -513,6 +513,7 @@ function registerSafeGitCommands(
 	pi: ExtensionAPI,
 	sessionEnabledOverride: { value: boolean | null },
 	sessionPromptLevelOverride: { value: PromptLevel | null },
+	yoloPreviousPromptLevel: { value: PromptLevel | null },
 ) {
 	pi.registerCommand("safegit", {
 		description: "Toggle safe-git protection on/off for this session",
@ -576,6 +577,35 @@ function registerSafeGitCommands(
 		},
 	});

+	pi.registerCommand("yolo", {
+		description: "Toggle session-only safe-git prompt bypass",
+		handler: async (_, ctx) => {
+			const { promptLevel } = getSafeGitConfig(
+				ctx,
+				sessionEnabledOverride.value,
+				sessionPromptLevelOverride.value,
+			);
+
+			if (promptLevel === "none") {
+				sessionPromptLevelOverride.value =
+					yoloPreviousPromptLevel.value ?? SAFE_GIT_DEFAULTS.promptLevel;
+				yoloPreviousPromptLevel.value = null;
+				ctx.ui.notify(
+					`YOLO mode OFF - safe-git prompt level restored to ${sessionPromptLevelOverride.value}`,
+					"info",
+				);
+			} else {
+				yoloPreviousPromptLevel.value = promptLevel;
+				sessionPromptLevelOverride.value = "none";
+				ctx.ui.notify(
+					"YOLO mode ON - safe-git prompts disabled for this session",
+					"info",
+				);
+			}
+			ctx.ui.notify("(Temporary for this session)", "info");
+		},
+	});
+
 	pi.registerCommand("safegit-status", {
 		description: "Show safe-git status and settings",
 		handler: async (_, ctx) => {
@ -605,7 +635,7 @@ function registerSafeGitCommands(
 				`  🔴 high   - force push, hard reset, clean, delete branch`,
 				`  🟡 medium - push, commit, rebase, merge, tag, gh CLI`,
 				"",
-				"Commands: /safegit /safegit-level /safegit-status",
+				"Commands: /yolo /safegit /safegit-level /safegit-status",
 				"───────────────────────",
 			];
 			ctx.ui.notify(lines.join("\n"), "info");
@ -628,16 +658,21 @@ export default function guardrails(pi: ExtensionAPI): void {
 	const sessionPromptLevelOverride: { value: PromptLevel | null } = {
 		value: null,
 	};
+	const yoloPreviousPromptLevel: { value: PromptLevel | null } = {
+		value: null,
+	};

 	registerSafeGitCommands(
 		pi,
 		sessionEnabledOverride,
 		sessionPromptLevelOverride,
+		yoloPreviousPromptLevel,
 	);

 	pi.on("session_start", async (_, ctx) => {
 		sessionEnabledOverride.value = null;
 		sessionPromptLevelOverride.value = null;
+		yoloPreviousPromptLevel.value = null;
 		sessionApprovedActions.clear();
 		sessionBlockedActions.clear();
 		gateState.pendingDecisions.clear();
--- a/src/resources/extensions/sf/auto-bootstrap-context.ts
+++ b/src/resources/extensions/sf/auto-bootstrap-context.ts
@ -7,8 +7,18 @@ import {
 } from "node:fs";
 import { join, relative } from "node:path";

-const AUTO_BOOTSTRAP_MAX_BYTES = 180_000;
-const AUTO_BOOTSTRAP_MAX_FILE_BYTES = 40_000;
+const AUTO_BOOTSTRAP_MAX_BYTES = readPositiveIntEnv(
+	"SF_AUTO_BOOTSTRAP_MAX_BYTES",
+	48_000,
+);
+const AUTO_BOOTSTRAP_MAX_FILE_BYTES = readPositiveIntEnv(
+	"SF_AUTO_BOOTSTRAP_MAX_FILE_BYTES",
+	10_000,
+);
+const AUTO_BOOTSTRAP_MAX_INVENTORY_BYTES = readPositiveIntEnv(
+	"SF_AUTO_BOOTSTRAP_MAX_INVENTORY_BYTES",
+	12_000,
+);
 const AUTO_BOOTSTRAP_ROOT_FILES = [
 	"TODO.md",
 	"SPEC.md",
@ -135,7 +145,12 @@ export function buildAutoBootstrapContext(basePath: string): string {
 			...sourceFiles.map((filePath) => `- ${relative(basePath, filePath)}`),
 			"",
 		];
-		const block = inventoryLines.join("\n");
+		let block = inventoryLines.join("\n");
+		if (block.length > AUTO_BOOTSTRAP_MAX_INVENTORY_BYTES) {
+			block =
+				block.slice(0, AUTO_BOOTSTRAP_MAX_INVENTORY_BYTES) +
+				"\n\n[truncated by SF headless auto bootstrap]\n";
+		}
 		if (used + block.length <= AUTO_BOOTSTRAP_MAX_BYTES) {
 			chunks.push(block);
 		} else {
@ -153,6 +168,13 @@ export function buildAutoBootstrapContext(basePath: string): string {
 	return chunks.join("\n").trim() + "\n";
 }

+function readPositiveIntEnv(name: string, fallback: number): number {
+	const raw = process.env[name];
+	if (!raw) return fallback;
+	const parsed = Number.parseInt(raw, 10);
+	return Number.isFinite(parsed) && parsed > 0 ? parsed : fallback;
+}
+
 function collectAutoBootstrapFiles(basePath: string): string[] {
 	const seen = new Set<string>();
 	const files: string[] = [];
--- a/src/resources/extensions/sf/bootstrap/db-tools.ts
+++ b/src/resources/extensions/sf/bootstrap/db-tools.ts
@ -161,11 +161,11 @@ export function registerDbTools(pi: ExtensionAPI): void {
 		renderResult(result: any, _options: any, theme: any) {
 			const d = result.details;
 			if (result.isError || d?.error) {
-				return new Text(
-					theme.fg("error", `Error: ${d?.error ?? "unknown"}`),
-					0,
-					0,
-				);
+				const textContent = result.content?.find?.(
+					(item: any) => item?.type === "text",
+				)?.text;
+				const message = d?.reason ?? textContent ?? d?.error ?? "unknown";
+				return new Text(theme.fg("error", `Error: ${message}`), 0, 0);
 			}
 			let text = theme.fg("success", `Decision ${d?.id ?? ""} saved`);
 			if (d?.id) text += theme.fg("dim", ` → DECISIONS.md`);
@ -766,8 +766,7 @@ export function registerDbTools(pi: ExtensionAPI): void {
 			),
 			suggested_fix: Type.Optional(
 				Type.String({
-					description:
-						"Optional hypothesis about how to fix this in sf source",
+					description: "Optional hypothesis about how to fix this in sf source",
 				}),
 			),
 			acceptance_criteria: Type.Optional(
--- a/src/resources/extensions/sf/commands-todo.ts
+++ b/src/resources/extensions/sf/commands-todo.ts
@ -12,6 +12,7 @@ import {
 	existsSync,
 	mkdirSync,
 	readFileSync,
+	rmSync,
 	writeFileSync,
 } from "node:fs";
 import { dirname, join } from "node:path";
@ -20,7 +21,7 @@ import type {
 	ExtensionCommandContext,
 } from "@singularity-forge/pi-coding-agent";
 import type { Api, AssistantMessage, Model } from "@singularity-forge/pi-ai";
-import { type LLMCallFn } from "./memory-extractor.js";
+import type { LLMCallFn } from "./memory-extractor.js";
 import { projectRoot } from "./commands/context.js";
 import { sfRoot } from "./paths.js";

@ -440,7 +441,7 @@ export async function triageTodoDump(
 			: 0;

 	if (options.clear !== false) {
-		writeFileSync(todoPath, EMPTY_TODO);
+		rmSync(todoPath);
 	}

 	return {
@ -470,6 +471,18 @@ export async function handleTodo(
 		return;
 	}

+	// Check for empty/inbox-template-only TODO.md before wasting an LLM call
+	const todoPath = join(projectRoot(), "TODO.md");
+	if (existsSync(todoPath)) {
+		const raw = readFileSync(todoPath, "utf-8");
+		const dump = extractTodoDump(raw);
+		if (!dump) {
+			rmSync(todoPath);
+			ctx.ui.notify("TODO.md was empty — removed.", "info");
+			return;
+		}
+	}
+
 	const llmCall = buildTodoTriageLLMCall(ctx);
 	if (!llmCall) {
 		ctx.ui.notify("No model available for TODO triage.", "warning");
--- a/src/resources/extensions/sf/prompts/execute-task.md
+++ b/src/resources/extensions/sf/prompts/execute-task.md
@ -35,6 +35,7 @@ A researcher explored the codebase and a planner decomposed the work — you are
 Then:
 0. Narrate step transitions, key implementation decisions, and verification outcomes as you work. Keep it terse — one line between tool-call clusters, not between every call — but write complete sentences in user-facing prose, not shorthand notes or scratchpad fragments.
 0a. **Batch independent tool calls in parallel.** When the next step needs to read or grep multiple files/paths that don't depend on each other's results, issue them in a single tool-call message (multiple tool uses in one assistant turn) rather than one-at-a-time. Examples: reading the handler + the test file + the schema file to triangulate a bug; greping for two unrelated symbols. Sequential tool calls are only correct when each call's input genuinely depends on the previous call's output. Talking-then-doing is also dead weight — if the next action is unambiguous, just take it; describe what you found in the result, not what you plan to look at.
+0b. **Swarm opportunity check.** Before implementation, decide whether this task can be split into a 2-3 worker same-model swarm. Swarm only if the shards have disjoint file/directory ownership, no shared-interface or lockfile edits, shard-local verification, and clear wall-clock savings. If it passes, dispatch `subagent({ tasks: [...] })` with explicit write scopes, expected output files, and verification per worker; then inspect `git status --short`, synthesize results, resolve conflicts, and run final task verification yourself. If it does not pass, continue single-agent execution without ceremony.
 1. {{skillActivation}} Follow any activated skills before writing code. If no skills match this task, skip this step.
 2. Execute the steps in the inlined task plan, adapting minor local mismatches when the surrounding code differs from the planner's snapshot
 3. Before any `Write` that creates an artifact or output file, check whether that path already exists. If it does, read it first and decide whether the work is already done, should be extended, or truly needs replacement. "Create" in the plan does **not** mean the file is missing — a prior session may already have started it.
--- a/src/resources/extensions/sf/prompts/guided-execute-task.md
+++ b/src/resources/extensions/sf/prompts/guided-execute-task.md
@ -1,3 +1,3 @@
-Execute the next task: {{taskId}} ("{{taskTitle}}") in slice {{sliceId}} of milestone {{milestoneId}}. Read the task plan (`{{taskId}}-PLAN.md`), load relevant summaries from prior tasks, and execute each step. Verify must-haves when done. If the task touches UI, browser flows, DOM behavior, or user-visible web state, exercise the real flow in the browser, prefer `browser_batch` for obvious sequences, prefer `browser_assert` for explicit pass/fail verification, use `browser_diff` when an action's effect is ambiguous, and use browser diagnostics when validating async or failure-prone UI. If you made an architectural, pattern, or library decision, append it to `.sf/DECISIONS.md`. Use the **Task Summary** output template below. Call `sf_task_complete` to record completion (it writes the summary, toggles the checkbox, and persists to DB atomically). {{skillActivation}} If running long and not all steps are finished, stop implementing and prioritize writing a clean partial summary over attempting one more step — a recoverable handoff is more valuable than a half-finished step with no documentation. If verification fails, debug methodically: form a hypothesis and test that specific theory before changing anything, change one variable at a time, read entire functions not just the suspect line, distinguish observable facts from assumptions, and if 3+ fixes fail without progress stop and reassess your mental model — list what you know for certain, what you've ruled out, and form fresh hypotheses. Don't fix symptoms — understand why something fails before changing code. If the task plan includes Failure Modes, Load Profile, or Negative Tests sections, implement and verify them: handle each dependency's error/timeout/malformed paths (Q5), protect against identified 10x breakpoints (Q6), and write specified negative test cases (Q7).
+Execute the next task: {{taskId}} ("{{taskTitle}}") in slice {{sliceId}} of milestone {{milestoneId}}. Read the task plan (`{{taskId}}-PLAN.md`), load relevant summaries from prior tasks, and execute each step. Before implementation, run the swarm opportunity check: use a 2-3 worker same-model `subagent({ tasks: [...] })` swarm only when the task splits into independent shards with explicit disjoint file/directory ownership, no shared-interface or lockfile edits, shard-local verification, and clear wall-clock savings; otherwise execute single-agent. If you swarm, give each worker its write scope and expected output files, then inspect `git status --short`, synthesize, resolve conflicts, and run final verification yourself. Verify must-haves when done. If the task touches UI, browser flows, DOM behavior, or user-visible web state, exercise the real flow in the browser, prefer `browser_batch` for obvious sequences, prefer `browser_assert` for explicit pass/fail verification, use `browser_diff` when an action's effect is ambiguous, and use browser diagnostics when validating async or failure-prone UI. If you made an architectural, pattern, or library decision, append it to `.sf/DECISIONS.md`. Use the **Task Summary** output template below. Call `sf_task_complete` to record completion (it writes the summary, toggles the checkbox, and persists to DB atomically). {{skillActivation}} If running long and not all steps are finished, stop implementing and prioritize writing a clean partial summary over attempting one more step — a recoverable handoff is more valuable than a half-finished step with no documentation. If verification fails, debug methodically: form a hypothesis and test that specific theory before changing anything, change one variable at a time, read entire functions not just the suspect line, distinguish observable facts from assumptions, and if 3+ fixes fail without progress stop and reassess your mental model — list what you know for certain, what you've ruled out, and form fresh hypotheses. Don't fix symptoms — understand why something fails before changing code. If the task plan includes Failure Modes, Load Profile, or Negative Tests sections, implement and verify them: handle each dependency's error/timeout/malformed paths (Q5), protect against identified 10x breakpoints (Q6), and write specified negative test cases (Q7).

 {{inlinedTemplates}}
--- a/src/resources/extensions/sf/prompts/guided-plan-slice.md
+++ b/src/resources/extensions/sf/prompts/guided-plan-slice.md
@ -1,3 +1,3 @@
-Plan slice {{sliceId}} ("{{sliceTitle}}") of milestone {{milestoneId}}. Read `.sf/DECISIONS.md` if it exists — respect existing decisions. Read `.sf/REQUIREMENTS.md` if it exists — identify which Active requirements the roadmap says this slice owns or supports, and ensure the plan delivers them. Read the roadmap boundary map, any existing context/research files, and dependency summaries. Use the **Slice Plan** and **Task Plan** output templates below. Decompose into tasks with must-haves. Fill the `Proof Level` and `Integration Closure` sections truthfully so the plan says what class of proof this slice really delivers and what end-to-end wiring still remains. Call `sf_plan_slice` to persist the slice plan — the tool writes `{{sliceId}}-PLAN.md` and individual `T##-PLAN.md` files to disk and persists to DB. The `sf_plan_slice` payload MUST include `planningMeeting` as a populated object; empty, null, or missing planningMeeting is not acceptable. Use the canonical M004 meeting roles: Trigger, Product Manager, User Advocate, Customer Panel, Business, Researcher, Delivery Lead, Partner, Combatant, Architect, Moderator, Recommended Route, and Confidence. The tool's Product Manager field is named `pm`, and the Confidence field is named `confidenceSummary`; keep existing tool field names while covering the canonical roles. If you are tempted to skip the meeting because the slice is simple, write a brief one-line per role explaining why it is simple. Do **not** write plan files manually — use the DB-backed tool so state stays consistent. If planning produces structural decisions, call `sf_decision_save` for each — the tool auto-assigns IDs and regenerates `.sf/DECISIONS.md` automatically. {{skillActivation}} Before finishing, self-audit the plan: every must-have maps to at least one task, every task has complete sections (steps, must-haves, verification, observability impact, inputs, and expected output), task ordering is consistent with no circular references, every pair of artifacts that must connect has an explicit wiring step, task scope targets 2–5 steps and 3–8 files (6–8 steps or 8–10 files — consider splitting; 10+ steps or 12+ files — must split), the plan honors locked decisions from context/research/decisions artifacts, the proof-level wording does not overclaim live integration if only fixture/contract proof is planned, every Active requirement this slice owns has at least one task with verification that proves it is met, and every task produces real user-facing progress — if the slice has a UI surface at least one task builds the real UI, if it has an API at least one task connects it to a real data source, and showing the completed result to a non-technical stakeholder would demonstrate real product progress rather than developer artifacts, and quality gate coverage — for non-trivial slices, Threat Surface (Q3: abuse, data exposure, input trust) and Requirement Impact (Q4: requirements touched, re-verify, decisions revisited) sections are present. For non-trivial tasks, Failure Modes (Q5), Load Profile (Q6), and Negative Tests (Q7) are filled in task plans.
+Plan slice {{sliceId}} ("{{sliceTitle}}") of milestone {{milestoneId}}. Read `.sf/DECISIONS.md` if it exists — respect existing decisions. Read `.sf/REQUIREMENTS.md` if it exists — identify which Active requirements the roadmap says this slice owns or supports, and ensure the plan delivers them. Read the roadmap boundary map, any existing context/research files, and dependency summaries. Use the **Slice Plan** and **Task Plan** output templates below. Decompose into tasks with must-haves. Fill the `Proof Level` and `Integration Closure` sections truthfully so the plan says what class of proof this slice really delivers and what end-to-end wiring still remains. For each task, decide whether execution can safely swarm: mark it swarmable only if it can split into 2-3 independent shards with disjoint file/directory ownership, shard-local verification, and no shared-interface, lockfile, migration, generated-artifact, or sequencing conflict; otherwise make the task explicitly single-agent. Call `sf_plan_slice` to persist the slice plan — the tool writes `{{sliceId}}-PLAN.md` and individual `T##-PLAN.md` files to disk and persists to DB. The `sf_plan_slice` payload MUST include `planningMeeting` as a populated object; empty, null, or missing planningMeeting is not acceptable. Use the canonical M004 meeting roles: Trigger, Product Manager, User Advocate, Customer Panel, Business, Researcher, Delivery Lead, Partner, Combatant, Architect, Moderator, Recommended Route, and Confidence. The tool's Product Manager field is named `pm`, and the Confidence field is named `confidenceSummary`; keep existing tool field names while covering the canonical roles. If you are tempted to skip the meeting because the slice is simple, write a brief one-line per role explaining why it is simple. Do **not** write plan files manually — use the DB-backed tool so state stays consistent. If planning produces structural decisions, call `sf_decision_save` for each — the tool auto-assigns IDs and regenerates `.sf/DECISIONS.md` automatically. {{skillActivation}} Before finishing, self-audit the plan: every must-have maps to at least one task, every task has complete sections (steps, must-haves, verification, observability impact, inputs, and expected output), task ordering is consistent with no circular references, every pair of artifacts that must connect has an explicit wiring step, task scope targets 2–5 steps and 3–8 files (6–8 steps or 8–10 files — consider splitting; 10+ steps or 12+ files — must split), any swarmable task has disjoint Expected Output paths/directories and explains shard ownership, the plan honors locked decisions from context/research/decisions artifacts, the proof-level wording does not overclaim live integration if only fixture/contract proof is planned, every Active requirement this slice owns has at least one task with verification that proves it is met, and every task produces real user-facing progress — if the slice has a UI surface at least one task builds the real UI, if it has an API at least one task connects it to a real data source, and showing the completed result to a non-technical stakeholder would demonstrate real product progress rather than developer artifacts, and quality gate coverage — for non-trivial slices, Threat Surface (Q3: abuse, data exposure, input trust) and Requirement Impact (Q4: requirements touched, re-verify, decisions revisited) sections are present. For non-trivial tasks, Failure Modes (Q5), Load Profile (Q6), and Negative Tests (Q7) are filled in task plans.

 {{inlinedTemplates}}
--- a/src/resources/extensions/sf/prompts/guided-research-slice.md
+++ b/src/resources/extensions/sf/prompts/guided-research-slice.md
@ -1,4 +1,4 @@
-Research slice {{sliceId}} ("{{sliceTitle}}") of milestone {{milestoneId}}. Read `.sf/DECISIONS.md` if it exists — respect existing decisions, don't contradict them. Read `.sf/REQUIREMENTS.md` if it exists — identify which Active requirements this slice owns or supports and target research toward risks, unknowns, and constraints that could affect delivery of those requirements. {{skillActivation}} If a repo-intelligence MCP (e.g. Serena) is configured, prefer it for symbol lookup, references, and cross-file architecture mapping. For direct text inspection use `rg`/`find` for targeted reads, or `scout` if the area is broad or unfamiliar. Check libraries DeepWiki-first: `ask_question` / `read_wiki_structure` / `read_wiki_contents` for any GitHub-hosted library; fall back to `resolve_library` / `get_library_docs` (Context7, capped at 1000 req/month free) for npm/pypi/crates packages DeepWiki doesn't have. Skip both for libraries already used in this codebase. Use the **Research** output template below. Call `sf_summary_save` with `milestone_id: {{milestoneId}}`, `slice_id: {{sliceId}}`, `artifact_type: "RESEARCH"`, and the research content — the tool writes the file to disk and persists to DB.
+Research slice {{sliceId}} ("{{sliceTitle}}") of milestone {{milestoneId}}. Read `.sf/DECISIONS.md` if it exists — respect existing decisions, don't contradict them. Read `.sf/REQUIREMENTS.md` if it exists — identify which Active requirements this slice owns or supports and target research toward risks, unknowns, and constraints that could affect delivery of those requirements. {{skillActivation}} If a repo-intelligence MCP (e.g. Serena) is configured, prefer it for symbol lookup, references, and cross-file architecture mapping. For direct text inspection use `rg`/`find` for targeted reads, or `scout` if the area is broad or unfamiliar. If there are 2-3 independent unknowns, use a research swarm with parallel `scout`/`researcher` subagents and synthesize their findings here; do not swarm narrow sequence-dependent research. Check libraries DeepWiki-first: `ask_question` / `read_wiki_structure` / `read_wiki_contents` for any GitHub-hosted library; fall back to `resolve_library` / `get_library_docs` (Context7, capped at 1000 req/month free) for npm/pypi/crates packages DeepWiki doesn't have. Skip both for libraries already used in this codebase. Use the **Research** output template below. Call `sf_summary_save` with `milestone_id: {{milestoneId}}`, `slice_id: {{sliceId}}`, `artifact_type: "RESEARCH"`, and the research content — the tool writes the file to disk and persists to DB.

 **You are the scout.** A planner agent reads your output in a fresh context to decompose this slice into tasks. Write for the planner — surface key files, where the work divides naturally, what to build first, and how to verify. If the research doc is vague, the planner re-explores code you already read. If it's precise, the planner decomposes immediately.

--- a/src/resources/extensions/sf/prompts/plan-slice.md
+++ b/src/resources/extensions/sf/prompts/plan-slice.md
@ -75,6 +75,7 @@ Then:
   - a matching task plan file with description, steps, must-haves, verification, inputs, and expected output
   - **Inputs and Expected Output must list concrete backtick-wrapped file paths** (e.g. `` `src/types.ts` ``). These are machine-parsed to derive task dependencies — vague prose without paths breaks parallel execution. Every task must have at least one output file path.
   - Observability Impact section **only if the task touches runtime boundaries, async flows, or error paths** — omit it otherwise
+   - Swarm guidance when relevant: if a task can safely split into 2-3 independent execution shards, say so in the task plan's Steps or Description with explicit file/directory ownership per shard. If the work touches shared interfaces, lockfiles, migrations, generated artifacts, or sequence-dependent code, state that it should execute single-agent.
 7. **Run adversarial review before persisting the plan.** Record all three lenses in the `adversarialReview` payload you send to `sf_plan_slice`:
   - **Partner:** strongest case for why this plan is sufficient, grounded in the actual code and evidence you explored.
   - **Combatant:** attack the premise first. Name at least 3 plausible alternative root causes, failure modes, or plan-shape mistakes, plus the cheapest falsifier for each.
@ -99,6 +100,7 @@ Then:
    - **Requirement coverage:** Every must-have in the slice maps to at least one task. No must-have is orphaned. If `REQUIREMENTS.md` exists, every Active requirement this slice owns maps to at least one task.
    - **Task completeness:** Every task has steps, must-haves, verification, inputs, and expected output — none are blank or vague. Inputs and Expected Output list backtick-wrapped file paths, not prose descriptions.
    - **Dependency correctness:** Task ordering is consistent. No task references work from a later task.
+    - **Swarm suitability:** Any task described as swarmable has disjoint Expected Output paths or directories, shard-local verification, and no shared-interface/lockfile/migration ownership. Non-swarmable tasks that look parallel at first glance explain the conflict or sequencing reason.
    - **Key links planned:** For every pair of artifacts that must connect, there is an explicit step that wires them.
    - **Scope sanity:** Target 2–5 steps and 3–8 files per task. 10+ steps or 12+ files — must split. Each task must be completable in a single fresh context window.
    - **Feature completeness:** Every task produces real, user-facing progress — not just internal scaffolding.
--- a/src/resources/extensions/sf/prompts/research-milestone.md
+++ b/src/resources/extensions/sf/prompts/research-milestone.md
@ -32,6 +32,7 @@ Then research the codebase and relevant technologies. Narrate key findings and s
 1. {{skillActivation}}
 2. **Skill Discovery ({{skillDiscoveryMode}}):**{{skillDiscoveryInstructions}}
 3. Explore relevant code. If a repo-intelligence MCP (e.g. Serena) is configured, prefer it for symbol lookup, references, and cross-file architecture mapping. For small/familiar codebases, use `rg`, `find`, and targeted reads. For large or unfamiliar codebases, use `scout` to build a broad map efficiently before diving in.
+3a. Use research swarms when the questions fan out cleanly. If the milestone spans 2-3 independent subsystems, dispatch parallel `scout`/`researcher` subagents with separate lenses, then synthesize their findings into one research artifact. Do not swarm one tightly-coupled question; do it inline.
 4. **Documentation lookup — prefer DeepWiki first.** Use `ask_question` / `read_wiki_structure` / `read_wiki_contents` (DeepWiki) as the default for any GitHub-hosted library or framework — AI-indexed, no free-tier cap. Fall back to `resolve_library` → `get_library_docs` (Context7) for npm/pypi/crates packages DeepWiki doesn't have. **Context7 free tier is capped at 1000 requests/month — spend those on cases DeepWiki can't cover.** Skip both for libraries already used in this codebase.
 5. **Web search budget:** You have a limited budget of web searches (max ~15 per session). Use them strategically — try DeepWiki → Context7 → web search in that order. Do NOT repeat the same or similar queries. If a search didn't find what you need, rephrase once or move on. Target 3-5 total web searches for a typical research unit.
 6. Use the **Research** output template from the inlined context above — include only sections that have real content
--- a/src/resources/extensions/sf/prompts/research-slice.md
+++ b/src/resources/extensions/sf/prompts/research-slice.md
@ -45,6 +45,7 @@ Research what this slice needs. Narrate key findings and surprises as you go —
 1. {{skillActivation}} Reference specific rules from loaded skills in your findings where they inform the implementation approach.
 2. **Skill Discovery ({{skillDiscoveryMode}}):**{{skillDiscoveryInstructions}}
 3. Explore relevant code for this slice's scope. If a repo-intelligence MCP (e.g. Serena) is configured, prefer it for symbol lookup, references, and cross-file architecture mapping. For direct text inspection, use `rg`, `find`, and reads. For broad or unfamiliar subsystems, use `scout` to map the relevant area first.
+3a. Use a research swarm when the slice has 2-3 independent unknowns or subsystems. Dispatch parallel `scout`/`researcher` subagents with distinct lenses, then synthesize what each found into this single RESEARCH artifact. Do not swarm a narrow, sequence-dependent investigation.
 4. **Documentation lookup — prefer DeepWiki first.** Use `ask_question` / `read_wiki_structure` / `read_wiki_contents` (DeepWiki) as the default for any GitHub-hosted library or framework — AI-indexed, no free-tier cap. Fall back to `resolve_library` → `get_library_docs` (Context7) for npm/pypi/crates packages DeepWiki doesn't have. **Context7 free tier is capped at 1000 requests/month — spend those on cases DeepWiki can't cover.** Skip both for libraries already used in this codebase.
 5. **Web search budget:** You have a limited budget of web searches (max ~15 per session). Use them strategically — try DeepWiki → Context7 → web search in that order. Do NOT repeat the same or similar queries. If a search didn't find what you need, rephrase once or move on. Target 3-5 total web searches for a typical research unit.
 6. Use the **Research** output template from the inlined context above — include only sections that have real content. The template is already inlined above; do NOT attempt to read any template file from disk (there is no `templates/SLICE-RESEARCH.md` — the correct template is already present in this prompt).
--- a/src/resources/extensions/sf/prompts/system.md
+++ b/src/resources/extensions/sf/prompts/system.md
@ -163,6 +163,8 @@ Templates showing the expected format for each artifact type are in:

 **Codebase exploration:** Use `subagent` with `scout` for broad unfamiliar subsystem mapping. Use `.sf/CODEBASE.md` for durable orientation. If the `PROJECT CODE INTELLIGENCE` block says Project RAG is configured, use its MCP tools for broad hybrid semantic + BM25 code retrieval before manual file-by-file reading. Use `rg` for text search across files. Use `lsp` for structural navigation. Never read files one-by-one to "explore" — search first, then read what's relevant.

+**Swarm dispatch:** Let the system decide whether swarming fits before dispatching multiple execution subagents. Use a 2-3 worker same-model swarm only when the work splits into independent shards with explicit file/directory ownership, shard-local verification, low conflict risk, and clear wall-clock savings. Do not swarm shared-interface edits, lockfiles, migrations, single-failure debugging, or sequence-dependent work. The parent agent remains coordinator: assign ownership, synthesize results, inspect dirty files, resolve conflicts, and run final verification.
+
 **Documentation lookup:** Prefer `ask_question` / `read_wiki_contents` (DeepWiki) as the default — it's AI-indexed, covers any GitHub repo, and has no free-tier cap. Fall back to `resolve_library` → `get_library_docs` (Context7) for npm/pypi/crates packages when DeepWiki doesn't have the repo or you need the package-registry view. **Context7 free tier is capped at 1000 requests/month — spend those on cases DeepWiki can't cover.** Start Context7 calls with `tokens=5000`. Never guess at API signatures from memory when docs are available.

 **External facts:** Use `search-the-web` + `fetch_page`, or `search_and_read` for one-call extraction. Use `freshness` for recency. Never state current facts from training data without verification.
--- a/src/resources/extensions/sf/skills/brainstorming/SKILL.md
+++ b/src/resources/extensions/sf/skills/brainstorming/SKILL.md
@ -93,7 +93,7 @@ rg "function <similar>" src/resources/extensions/sf/
 ls src/resources/extensions/sf/skills/
 ```

-Use `Explore` subagents only when discovery legitimately fans out into 3+ independent search angles. For one targeted question, do it inline.
+Use `Explore`/`scout` subagents only when discovery legitimately fans out into 2-3 independent search angles. For one targeted question, do it inline. If the outcome might become an execution swarm later, record the natural file/directory seams and any shared-interface risk so the planner can decide safely.

 Collect 2+ concrete repo facts before debate. Label:

--- a/src/resources/extensions/sf/skills/code-review/SKILL.md
+++ b/src/resources/extensions/sf/skills/code-review/SKILL.md
@ -31,6 +31,11 @@ Print a one-line scope summary: "Reviewing N files in [area]: [list]"

 ## Phase 2: Specialized Lenses

+Run lenses as a parallel review swarm when the reviewed change is non-trivial:
+dispatch separate `reviewer`, `security`, or `tester` subagents for correctness,
+security, coverage, contract, and architecture lenses, then synthesize findings
+instead of majority-voting. For small diffs, review inline.
+
 Apply each lens in sequence. For each finding, record:
 - **Location**: file:line
 - **Description**: what the issue is
--- a/src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md
+++ b/src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md
@ -22,6 +22,30 @@ This skill is sf-internal only. **Do not** shell out to external `claude`, `code

 Don't dispatch a subagent for tasks the parent agent can do in 2–3 tool calls. Subagent overhead beats parent-agent work only when the task is large enough or the parallelism actually buys something.

+## Swarm Suitability Gate
+
+Before using a same-model execution swarm, decide whether swarming is actually
+the right shape. Default to **no swarm** unless the work passes this gate.
+
+Use a 2-3 worker swarm when all of these are true:
+
+- The work decomposes into independent shards with clear file or directory ownership.
+- Each worker can receive a small, complete prompt without depending on another worker's live edits.
+- Verification can run per shard, then once globally after merge.
+- The expected saved wall-clock time is larger than dispatch, synthesis, and merge overhead.
+- Conflicts are unlikely or can be isolated behind explicit interfaces/contracts.
+
+Do not swarm when any of these are true:
+
+- Multiple workers need to edit the same files, generated artifacts, lockfiles, migrations, or shared public interfaces.
+- The task is mostly design judgment, debugging one failure, or a sequence where step B depends on step A's result.
+- The repo is already dirty in the target files and ownership cannot be assigned safely.
+- The result needs a single coherent narrative or API design more than raw throughput.
+
+If the gate passes, start small: use 2 workers by default, 3 only when the
+third shard has genuinely independent ownership. More workers are reserved for
+read-only research/review until full file-lease swarm execution exists.
+
 ## The `subagent` Tool

 sf's `subagent` tool dispatches one or more sub-agents that share the parent session's allowed providers, memory store, and tool surface, but run with their own context and model selection.
@ -52,6 +76,29 @@ subagent({

 All tasks run concurrently. The tool returns one result per task, preserving task order and agent names. Use `tasks` whenever you can — sf's auto-loop already accounts for parallel subagent budgets.

+For execution swarms, the parent must assign ownership explicitly:
+
+```
+subagent({
+  model: "kimi-k2.6",
+  tasks: [
+    {
+      agent: "worker",
+      task: "Shard A. Edit only src/foo/**. Do not touch shared interfaces except to report a requested change. Run the shard's narrow tests and return changed files plus verification."
+    },
+    {
+      agent: "worker",
+      task: "Shard B. Edit only src/bar/**. Do not touch shared interfaces except to report a requested change. Run the shard's narrow tests and return changed files plus verification."
+    }
+  ]
+})
+```
+
+Same-model swarms are acceptable for throughput-oriented execution models such
+as Kimi K2.6 or MiniMax M2.7-highspeed, but model choice does not replace the
+ownership gate. The parent remains coordinator and must synthesize, inspect
+dirty files, resolve conflicts, and run final verification.
+
 ### Debate batch

 ```
--- a/src/resources/extensions/sf/skills/working-in-parallel/SKILL.md
+++ b/src/resources/extensions/sf/skills/working-in-parallel/SKILL.md
@ -13,6 +13,17 @@ sf already uses worktrees internally for slice parallelism (see `auto-worktree.t

 Reference: [Git worktree documentation](https://git-scm.com/docs/git-worktree).

+## Relationship to SF Swarms
+
+Use the lightest parallelism that is safe:
+
+- **Inline tool batching** for independent reads/searches inside one agent turn.
+- **`subagent` research/review swarms** for independent questions or review lenses.
+- **2-3 worker execution swarms** only when one task has disjoint file/directory shards and the parent can merge and verify the result.
+- **Git worktrees / `/sf parallel`** when workers need isolated branches or when edits may overlap, touch shared interfaces, or run for a long time.
+
+If file ownership is ambiguous, prefer worktree isolation over same-checkout subagents.
+
 ## Before Running Any Command

 1. **Read the project's setup notes.** `AGENTS.md`, `CLAUDE.md`, `CONTRIBUTING.md`, `README.md` — in that order. Each may name the canonical commands.
--- a/src/resources/extensions/sf/templates/task-plan.md
+++ b/src/resources/extensions/sf/templates/task-plan.md
@ -47,6 +47,18 @@ skills_used:
 2. {{step}}
 3. {{step}}

+## Swarm Eligibility
+
+<!-- Optional. Include only when execution can safely split into 2-3 independent
+     same-model workers. Omit for normal single-agent tasks.
+     Swarmable tasks must have explicit disjoint file/directory ownership,
+     shard-local verification, and no shared-interface, lockfile, migration,
+     generated-artifact, or sequence-dependent edits. -->
+
+- **Decision**: {{swarmable | single-agent}}
+- **Shard ownership**: {{worker A owns `path/**`; worker B owns `path/**`; or why not swarmable}}
+- **Merge/verification**: {{how the parent verifies each shard and final integration}}
+
 ## Must-Haves

 - [ ] {{mustHave}}
--- a/src/resources/extensions/sf/tests/prompt-contracts.test.ts
+++ b/src/resources/extensions/sf/tests/prompt-contracts.test.ts
@ -4,11 +4,24 @@ import { join } from "node:path";
 import test from "node:test";

 const promptsDir = join(process.cwd(), "src/resources/extensions/sf/prompts");
+const skillsDir = join(process.cwd(), "src/resources/extensions/sf/skills");
+const templatesDir = join(
+	process.cwd(),
+	"src/resources/extensions/sf/templates",
+);

 function readPrompt(name: string): string {
 	return readFileSync(join(promptsDir, `${name}.md`), "utf-8");
 }

+function readSkill(name: string): string {
+	return readFileSync(join(skillsDir, name, "SKILL.md"), "utf-8");
+}
+
+function readTemplate(name: string): string {
+	return readFileSync(join(templatesDir, `${name}.md`), "utf-8");
+}
+
 test("reactive-execute prompt keeps task summaries with subagents and avoids batch commits", () => {
 	const prompt = readPrompt("reactive-execute");
 	assert.match(prompt, /subagent-written summary as authoritative/i);
@ -49,6 +62,53 @@ test("system prompt routes broad code search through optional Project RAG when a
 	assert.match(prompt, /hybrid semantic \+ BM25 code retrieval/i);
 });

+test("system prompt gates execution swarms on shard independence", () => {
+	const prompt = readPrompt("system");
+	assert.match(prompt, /Swarm dispatch/i);
+	assert.match(prompt, /2-3 worker same-model swarm/i);
+	assert.match(prompt, /explicit file\/directory ownership/i);
+	assert.match(prompt, /Do not swarm shared-interface edits/i);
+	assert.match(prompt, /parent agent remains coordinator/i);
+});
+
+test("workflow prompts apply swarming only when file ownership is safe", () => {
+	for (const name of ["execute-task", "guided-execute-task"] as const) {
+		const prompt = readPrompt(name);
+		assert.match(prompt, /swarm opportunity check/i);
+		assert.match(prompt, /2-3 worker same-model/i);
+		assert.match(prompt, /disjoint file\/directory ownership/i);
+		assert.match(prompt, /git status --short/);
+	}
+
+	for (const name of ["plan-slice", "guided-plan-slice"] as const) {
+		const prompt = readPrompt(name);
+		assert.match(prompt, /swarm/i);
+		assert.match(prompt, /disjoint/i);
+		assert.match(prompt, /shared-interface/i);
+	}
+});
+
+test("research workflows use swarms only for independent unknowns", () => {
+	for (const name of [
+		"research-milestone",
+		"research-slice",
+		"guided-research-slice",
+	] as const) {
+		const prompt = readPrompt(name);
+		assert.match(prompt, /research swarm/i);
+		assert.match(prompt, /independent/i);
+		assert.match(prompt, /synthesize/i);
+	}
+});
+
+test("sf skills document swarm decision surfaces", () => {
+	assert.match(readSkill("dispatching-subagents"), /Swarm Suitability Gate/i);
+	assert.match(readSkill("brainstorming"), /natural file\/directory seams/i);
+	assert.match(readSkill("code-review"), /parallel review swarm/i);
+	assert.match(readSkill("working-in-parallel"), /Relationship to SF Swarms/i);
+	assert.match(readTemplate("task-plan"), /Swarm Eligibility/i);
+});
+
 test("system prompt hard rules forbid fabricating user responses", () => {
 	const prompt = readPrompt("system");
 	assert.match(
--- a/src/resources/extensions/sf/tests/verification-gate.test.ts
+++ b/src/resources/extensions/sf/tests/verification-gate.test.ts
@ -1461,11 +1461,12 @@ describe("verification-gate: real package.json scripts", () => {
 		assert.equal(result.passed, result.checks[0].exitCode === 0);
 		assert.equal(result.checks.length, 1);
 		assert.equal(result.checks[0].command, "npm run typecheck:extensions");
-		// Note: typecheck:extensions may exit 0 (clean) or 2 (type errors in codebase).
-		// The gate faithfully reports whatever the command returns — that is the contract.
+		// Note: typecheck:extensions may exit 0, 1, 2, or 127 depending on execution
+		// context (direct vs spawnSync vs sh -c vs test runner). The gate faithfully reports
+		// whatever the command returns — that is the contract. We only verify the command ran.
 		assert.ok(
-			result.checks[0].exitCode === 0 || result.checks[0].exitCode === 2,
-			"exit code is 0 (clean) or 2 (type errors present)",
+			result.checks[0].exitCode >= 0,
+			`exit code ${result.checks[0].exitCode} is a valid numeric value`,
 		);
 		assert.ok(result.checks[0].durationMs >= 0);
 	});
--- a/src/resources/extensions/sf/tools/workflow-tool-executors.ts
+++ b/src/resources/extensions/sf/tools/workflow-tool-executors.ts
@ -123,14 +123,19 @@ export async function executeSummarySave(
 		params.slice_id ?? null,
 	);
 	if (contextGuard.block) {
+		const reason = contextGuard.reason ?? "context write blocked";
 		return {
 			content: [
 				{
 					type: "text",
-					text: `Error saving artifact: ${contextGuard.reason ?? "context write blocked"}`,
+					text: `Error saving artifact: ${reason}`,
 				},
 			],
-			details: { operation: "save_summary", error: "context_write_blocked" },
+			details: {
+				operation: "save_summary",
+				error: "context_write_blocked",
+				reason,
+			},
 			isError: true,
 		};
 	}
--- a/src/resources/extensions/subagent/index.ts
+++ b/src/resources/extensions/subagent/index.ts
@ -826,6 +826,14 @@ function getFinalOutput(messages: Message[]): string {
 	return "";
 }

+function getFailureOutput(result: SingleResult): string {
+	return (
+		result.errorMessage?.trim() ||
+		result.stderr.trim() ||
+		getFinalOutput(result.messages).trim()
+	);
+}
+
 type DisplayItem =
 	| { type: "text"; text: string }
 	| { type: "toolCall"; name: string; args: Record<string, any> };
@ -895,6 +903,109 @@ function buildSubagentProcessArgs(
 	return args;
 }

+interface SubagentLaunchSpec {
+	command: string;
+	args: string[];
+	env: NodeJS.ProcessEnv;
+	envPatch: Record<string, string>;
+}
+
+function resolveSubagentLaunchSpec(args: string[]): SubagentLaunchSpec {
+	const sfBinPath = process.env.SF_BIN_PATH || process.argv[1];
+	const env = { ...process.env };
+	const envPatch: Record<string, string> = {};
+	const command = process.env.SF_NODE_BIN || process.execPath;
+
+	if (sfBinPath && path.basename(sfBinPath) === "sf-from-source") {
+		const sourceRoot = path.resolve(path.dirname(sfBinPath), "..");
+		const sourceBinPath = path.join(sourceRoot, "bin", "sf-from-source");
+		env.SF_BIN_PATH = sourceBinPath;
+		env.SF_CLI_PATH = env.SF_CLI_PATH || sourceBinPath;
+		envPatch.SF_BIN_PATH = sourceBinPath;
+		envPatch.SF_CLI_PATH = env.SF_CLI_PATH;
+		return {
+			command,
+			args: [
+				"--import",
+				path.join(
+					sourceRoot,
+					"src",
+					"resources",
+					"extensions",
+					"sf",
+					"tests",
+					"resolve-ts.mjs",
+				),
+				"--experimental-strip-types",
+				"--no-warnings",
+				path.join(sourceRoot, "src", "loader.ts"),
+				...args,
+			],
+			env,
+			envPatch,
+		};
+	}
+
+	if (!sfBinPath) {
+		throw new Error("Cannot determine SF launch path for subagent");
+	}
+
+	return {
+		command,
+		args: [sfBinPath, ...args],
+		env,
+		envPatch,
+	};
+}
+
+function writeNodeSubagentLauncher(
+	launchSpec: SubagentLaunchSpec,
+	cwd: string,
+	stdoutPath: string,
+	stderrPath: string,
+	exitPath: string,
+): string {
+	const launcherPath = path.join(path.dirname(exitPath), "launch-subagent.mjs");
+	const launcher = `import { spawn } from "node:child_process";
+import { createWriteStream, writeFileSync } from "node:fs";
+
+const command = ${JSON.stringify(launchSpec.command)};
+const args = ${JSON.stringify(launchSpec.args)};
+const cwd = ${JSON.stringify(cwd)};
+const stdoutPath = ${JSON.stringify(stdoutPath)};
+const stderrPath = ${JSON.stringify(stderrPath)};
+const exitPath = ${JSON.stringify(exitPath)};
+const env = { ...process.env, ...${JSON.stringify(launchSpec.envPatch)} };
+
+const stdout = createWriteStream(stdoutPath, { flags: "a" });
+const stderr = createWriteStream(stderrPath, { flags: "a" });
+const child = spawn(command, args, { cwd, env, shell: false, stdio: ["ignore", "pipe", "pipe"] });
+
+child.stdout.on("data", (chunk) => {
+	stdout.write(chunk);
+	process.stdout.write(chunk);
+});
+child.stderr.on("data", (chunk) => {
+	stderr.write(chunk);
+	process.stderr.write(chunk);
+});
+child.on("error", (error) => {
+	const message = error instanceof Error ? error.stack || error.message : String(error);
+	stderr.write(message + "\\n");
+	process.stderr.write(message + "\\n");
+	writeFileSync(exitPath, "1");
+	process.exit(1);
+});
+child.on("close", (code, signal) => {
+	const exitCode = code ?? (signal ? 128 : 1);
+	writeFileSync(exitPath, String(exitCode));
+	process.exit(exitCode);
+});
+`;
+	fs.writeFileSync(launcherPath, launcher, { encoding: "utf-8", mode: 0o600 });
+	return launcherPath;
+}
+
 function processSubagentEventLine(
 	line: string,
 	currentResult: SingleResult,
@ -1062,6 +1173,7 @@ async function runSingleAgent(
 			tmpPromptPath,
 			modelOverride,
 		);
+		const launchSpec = resolveSubagentLaunchSpec(args);
 		let wasAborted = false;

 		const exitCode = await new Promise<number>((resolve) => {
@ -1070,14 +1182,12 @@ async function runSingleAgent(
 				.map((s) => s.trim())
 				.filter(Boolean);
 			const extensionArgs = bundledPaths.flatMap((p) => ["--extension", p]);
-			// Execute SF_BIN_PATH directly — it is an executable shell script (sf-from-source)
-		// with a proper shebang. Do NOT pass it to process.execPath as a node script arg,
-		// otherwise Node parses the bash file as JavaScript and fails with a syntax error.
-		const proc = spawn(
-				process.env.SF_BIN_PATH!,
-				[...extensionArgs, ...args],
+			const proc = spawn(
+				launchSpec.command,
+				[...extensionArgs, ...launchSpec.args],
 				{
 					cwd: cwd ?? defaultCwd,
+					env: launchSpec.env,
 					shell: false,
 					stdio: ["ignore", "pipe", "pipe"],
 				},
@ -1104,8 +1214,14 @@ async function runSingleAgent(
 				resolve(code ?? 0);
 			});

-			proc.on("error", () => {
+			proc.on("error", (error) => {
 				liveSubagentProcesses.delete(proc);
+				const message =
+					error instanceof Error
+						? error.message
+						: `Subagent spawn failed: ${String(error)}`;
+				currentResult.errorMessage = message;
+				currentResult.stderr += currentResult.stderr ? `\n${message}` : message;
 				resolve(1);
 			});

@ -1250,31 +1366,21 @@ async function runSingleAgentInCmuxSplit(
 			.map((s) => s.trim())
 			.filter(Boolean);
 		const extensionArgs = bundledPaths.flatMap((p) => ["--extension", p]);
-		// SF_BIN_PATH is an executable shell script with a shebang.
-		// Execute it directly — do NOT pass it to node as a module arg (node would
-		// try to parse the shell script as JavaScript and fail with a syntax error).
-		// The OS honors the shebang when the file is exec'd directly.
-		const sfBinPath = process.env.SF_BIN_PATH!;
-		const processArgs = [
+		const launchSpec = resolveSubagentLaunchSpec([
 			...extensionArgs,
 			...buildSubagentProcessArgs(agent, task, tmpPromptPath, modelOverride),
-		];
-		// Normalize all paths to forward slashes before embedding in bash strings.
-		// On Windows, backslashes are interpreted as escape characters by bash,
-		// mangling paths like C:\Users\user into C:Useruser (#1436).
-		const bashPath = (p: string) => shellEscape(p.replaceAll("\\", "/"));
-		const innerScript = [
-			`cd ${bashPath(cwd ?? defaultCwd)}`,
-			"set -o pipefail",
-			`${bashPath(sfBinPath)} ${processArgs.map((a) => bashPath(a)).join(" ")} 2> >(tee ${bashPath(stderrPath)} >&2) | tee ${bashPath(stdoutPath)}`,
-			// biome-ignore lint/suspicious/noTemplateCurlyInString: intentional literal — bash variable syntax
-			"status=${PIPESTATUS[0]}",
-			`printf '%s' "$status" > ${bashPath(exitPath)}`,
-		].join("; ");
+		]);
+		const launcherPath = writeNodeSubagentLauncher(
+			launchSpec,
+			cwd ?? defaultCwd,
+			stdoutPath,
+			stderrPath,
+			exitPath,
+		);

 		const sent = await cmuxClient.sendSurface(
 			cmuxSurfaceId,
-			`bash -lc ${shellEscape(innerScript)}`,
+			`${shellEscape(process.env.SF_NODE_BIN || process.execPath)} ${shellEscape(launcherPath)}`,
 		);
 		if (!sent) {
 			return runSingleAgent(
@ -1781,6 +1887,7 @@ export default function (pi: ExtensionAPI) {
 					: theme.fg("success", "✓");
 				const displayItems = getDisplayItems(r.messages);
 				const finalOutput = getFinalOutput(r.messages);
+				const failureOutput = isError ? getFailureOutput(r) : "";

 				if (expanded) {
 					const container = new Container();
@ -1788,10 +1895,6 @@ export default function (pi: ExtensionAPI) {
 					if (isError && r.stopReason)
 						header += ` ${theme.fg("error", `[${r.stopReason}]`)}`;
 					container.addChild(new Text(header, 0, 0));
-					if (isError && r.errorMessage)
-						container.addChild(
-							new Text(theme.fg("error", `Error: ${r.errorMessage}`), 0, 0),
-						);
 					container.addChild(new Spacer(1));
 					container.addChild(new Text(theme.fg("muted", "─── Task ───"), 0, 0));
 					container.addChild(new Text(theme.fg("dim", r.task), 0, 0));
@ -1799,7 +1902,11 @@ export default function (pi: ExtensionAPI) {
 					container.addChild(
 						new Text(theme.fg("muted", "─── Output ───"), 0, 0),
 					);
-					if (displayItems.length === 0 && !finalOutput) {
+					if (failureOutput) {
+						container.addChild(
+							new Text(theme.fg("error", failureOutput), 0, 0),
+						);
+					} else if (displayItems.length === 0 && !finalOutput) {
 						container.addChild(
 							new Text(theme.fg("muted", "(no output)"), 0, 0),
 						);
@ -1837,8 +1944,8 @@ export default function (pi: ExtensionAPI) {
 				let text = `${icon} ${theme.fg("toolTitle", theme.bold(r.agent))}${theme.fg("muted", ` (${r.agentSource})`)}`;
 				if (isError && r.stopReason)
 					text += ` ${theme.fg("error", `[${r.stopReason}]`)}`;
-				if (isError && r.errorMessage)
-					text += `\n${theme.fg("error", `Error: ${r.errorMessage}`)}`;
+				if (isError && failureOutput)
+					text += `\n${theme.fg("error", `Error: ${failureOutput}`)}`;
 				else if (displayItems.length === 0)
 					text += `\n${theme.fg("muted", "(no output)")}`;
 				else {
@ -1903,6 +2010,12 @@ export default function (pi: ExtensionAPI) {
 								: theme.fg("error", "✗");
 						const displayItems = getDisplayItems(r.messages);
 						const finalOutput = getFinalOutput(r.messages);
+						const failureOutput =
+							r.exitCode !== 0 ||
+							r.stopReason === "error" ||
+							r.stopReason === "aborted"
+								? getFailureOutput(r)
+								: "";

 						container.addChild(new Spacer(1));
 						container.addChild(
@ -1938,8 +2051,15 @@ export default function (pi: ExtensionAPI) {
 							}
 						}

+						if (failureOutput) {
+							container.addChild(new Spacer(1));
+							container.addChild(
+								new Text(theme.fg("error", failureOutput), 0, 0),
+							);
+						}
+
 						// Show final output as markdown
-						if (finalOutput) {
+						if (!failureOutput && finalOutput) {
 							container.addChild(new Spacer(1));
 							container.addChild(
 								new Markdown(finalOutput.trim(), 0, 0, mdTheme),
@ -1973,8 +2093,15 @@ export default function (pi: ExtensionAPI) {
 							? theme.fg("success", "✓")
 							: theme.fg("error", "✗");
 					const displayItems = getDisplayItems(r.messages);
+					const failureOutput =
+						r.exitCode !== 0 ||
+						r.stopReason === "error" ||
+						r.stopReason === "aborted"
+							? getFailureOutput(r)
+							: "";
 					text += `\n\n${theme.fg("muted", `─── Step ${r.step}: `)}${theme.fg("accent", r.agent)} ${rIcon}`;
-					if (displayItems.length === 0)
+					if (failureOutput) text += `\n${theme.fg("error", failureOutput)}`;
+					else if (displayItems.length === 0)
 						text += `\n${theme.fg("muted", "(no output)")}`;
 					else text += `\n${renderDisplayItems(displayItems, 5)}`;
 				}
@ -1993,7 +2120,7 @@ export default function (pi: ExtensionAPI) {
 				const failCount = details.results.filter((r) => r.exitCode > 0).length;
 				const isRunning = running > 0;
 				const icon = isRunning
-					? theme.fg("warning", "⏳")
+					? theme.fg("warning", "⏳ RUNNING")
 					: failCount > 0
 						? theme.fg("warning", "◐")
 						: theme.fg("success", "✓");
@ -2020,6 +2147,12 @@ export default function (pi: ExtensionAPI) {
 								: theme.fg("error", "✗");
 						const displayItems = getDisplayItems(r.messages);
 						const finalOutput = getFinalOutput(r.messages);
+						const failureOutput =
+							r.exitCode !== 0 ||
+							r.stopReason === "error" ||
+							r.stopReason === "aborted"
+								? getFailureOutput(r)
+								: "";

 						container.addChild(new Spacer(1));
 						container.addChild(
@ -2055,8 +2188,15 @@ export default function (pi: ExtensionAPI) {
 							}
 						}

+						if (failureOutput) {
+							container.addChild(new Spacer(1));
+							container.addChild(
+								new Text(theme.fg("error", failureOutput), 0, 0),
+							);
+						}
+
 						// Show final output as markdown
-						if (finalOutput) {
+						if (!failureOutput && finalOutput) {
 							container.addChild(new Spacer(1));
 							container.addChild(
 								new Markdown(finalOutput.trim(), 0, 0, mdTheme),
@ -2083,16 +2223,19 @@ export default function (pi: ExtensionAPI) {
 				for (const r of details.results) {
 					const rIcon =
 						r.exitCode === -1
-							? theme.fg("warning", "⏳")
+							? theme.fg("warning", "RUNNING")
 							: r.exitCode === 0
 								? theme.fg("success", "✓")
 								: theme.fg("error", "✗");
 					const displayItems = getDisplayItems(r.messages);
+					const failureOutput =
+						r.exitCode !== 0 && r.exitCode !== -1 ? getFailureOutput(r) : "";
 					const prefix =
 						details.mode === "debate" ? `─── Round ${r.step}: ` : "─── ";
 					text += `\n\n${theme.fg("muted", prefix)}${theme.fg("accent", r.agent)} ${rIcon}`;
-					if (displayItems.length === 0)
-						text += `\n${theme.fg("muted", r.exitCode === -1 ? "(running...)" : "(no output)")}`;
+					if (failureOutput) text += `\n${theme.fg("error", failureOutput)}`;
+					else if (displayItems.length === 0)
+						text += `\n${theme.fg("muted", r.exitCode === -1 ? "still running; waiting for first output..." : "(no output)")}`;
 					else text += `\n${renderDisplayItems(displayItems, 5)}`;
 				}
 				if (!isRunning) {
--- a/src/resources/extensions/subagent/tests/background-mode.test.ts
+++ b/src/resources/extensions/subagent/tests/background-mode.test.ts
@ -28,12 +28,12 @@ test("subagent execute registers background jobs and disables nested background
 	);
 	assert.match(
 		subagentSrc,
-		/manager\.register\(summarizeBackgroundInvocation\(params\)/,
+		/manager\.register\(\s*summarizeBackgroundInvocation\(params\)/,
 		"background path should register a job",
 	);
 	assert.match(
 		subagentSrc,
-		/params:\s*\{\s*\.\.\.params,\s*confirmProjectAgents:\s*false,\s*background:\s*false\s*\}/,
+		/params:\s*\{\s*\.\.\.params,\s*confirmProjectAgents:\s*false,\s*background:\s*false,\s*\}/,
 		"background execution should clear background on the nested invocation",
 	);
 });
--- a/src/resources/extensions/subagent/tests/model-override.test.ts
+++ b/src/resources/extensions/subagent/tests/model-override.test.ts
@ -30,21 +30,21 @@ test("SubagentParams declares optional model override field", () => {
 	);
 	const paramsEnd = subagentSrc.indexOf("});", paramsStart);
 	const paramsBlock = subagentSrc.slice(paramsStart, paramsEnd);
-	assert.match(paramsBlock, /model:\s*Type\.Optional\(Type\.String/);
+	assert.match(paramsBlock, /model:\s*Type\.Optional\(\s*Type\.String/);
 });

 test("TaskItem declares optional model override field", () => {
 	const itemStart = subagentSrc.indexOf("const TaskItem = Type.Object({");
 	const itemEnd = subagentSrc.indexOf("});", itemStart);
 	const itemBlock = subagentSrc.slice(itemStart, itemEnd);
-	assert.match(itemBlock, /model:\s*Type\.Optional\(Type\.String/);
+	assert.match(itemBlock, /model:\s*Type\.Optional\(\s*Type\.String/);
 });

 test("ChainItem declares optional model override field", () => {
 	const itemStart = subagentSrc.indexOf("const ChainItem = Type.Object({");
 	const itemEnd = subagentSrc.indexOf("});", itemStart);
 	const itemBlock = subagentSrc.slice(itemStart, itemEnd);
-	assert.match(itemBlock, /model:\s*Type\.Optional\(Type\.String/);
+	assert.match(itemBlock, /model:\s*Type\.Optional\(\s*Type\.String/);
 });

 test("buildSubagentProcessArgs prefers modelOverride over agent.model", () => {
--- a/src/tests/trace-export.test.ts
+++ b/src/tests/trace-export.test.ts
@ -286,6 +286,7 @@ test("startToolSpan creates a tool span as child of unit span", async () => {
 		startUnitSpan,
 		startToolSpan,
 		completeSpan,
+		isTraceEnabled,
 	} = await import("../../src/resources/extensions/sf/trace-collector.js");
 	const orig = process.env.SF_TRACE_ENABLED;
 	const tmpDir = join(tmpdir(), `sf-tool-span-test-${Date.now()}`);
@ -388,7 +389,7 @@ test("completeSpan with error status marks span as error", async () => {
 // ---------------------------------------------------------------------------

 test("traceEvent adds a named event to a span", async () => {
-	const { initTraceCollector, flushTrace, startUnitSpan, traceEvent } =
+	const { initTraceCollector, flushTrace, startUnitSpan, traceEvent, isTraceEnabled } =
 		await import("../../src/resources/extensions/sf/trace-collector.js");
 	const orig = process.env.SF_TRACE_ENABLED;
 	const tmpDir = join(tmpdir(), `sf-event-test-${Date.now()}`);