Design doc for collapsing the five parallel agent-dispatch sites (defaultAgentRunner, runHeadlessPrompt, runSingleAgent, runUnitViaSwarm, slice-parallel-orchestrator) onto one runtime with three orthogonal axes — persistence, isolation, routing. 590 lines, ~5200 words. Includes: - Problem statement with five concrete pain points from this session's swarm convergence rounds (spawn hangs, inbox cache, checkpoint synthesis, ledger isolation, etc.) - Worked-out TypeScript interface - Mapping of each existing site to runtime options (table) - 8-step migration plan in blast-radius order (~4-5 days focused work) - Open questions No source-code changes. Implementation comes later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
40 KiB
Proposal: Unified AgentRuntime Abstraction
Status: Draft
Date: 2026-05-15
Author: Mikael Hugo
Related ADRs: ADR-0000 (purpose-to-software compiler), ADR-0075 (UOK gate architecture), ADR-0079 (solver/executor separation)
1. Problem Statement
1.1 Five Parallel Dispatch Paths
SF currently dispatches agent work through five entirely separate code paths. Each path independently manages cancellation, timeouts, event collection, error representation, and (where relevant) swarm routing. They share no interface, no contract, and no test fixture.
| # | Name | File | Line | Mechanism |
|---|---|---|---|---|
| 1 | defaultAgentRunner |
src/headless-triage.ts |
~363 | runSubagent directly — composes system prompt via subagent/prompt-parts, fires and awaits one runSubagent call per triage agent |
| 2 | runHeadlessPrompt / runAgentTurn |
src/resources/extensions/sf/uok/agent-runner.js |
~73 / ~129 | runSubagent through a two-layer wrapper — runAgentTurn reads the PersistentAgent inbox, builds an assembled prompt, calls runHeadlessPrompt, then writes a reply back to the MessageBus |
| 3 | runSingleAgent / runSingleAgentViaSwarm |
src/resources/extensions/sf/subagent/index.js |
~1266 / ~1140 | runSubagent directly (default path) or swarmDispatchAndWait (flag SF_SUBAGENT_VIA_SWARM=1) — used by /delegate, /rubber-duck, and all interactive subagent commands |
| 4 | runUnitViaSwarm |
src/resources/extensions/sf/auto/run-unit.js |
~210 | swarmDispatchAndWait — activated by SF_AUTONOMOUS_VIA_SWARM=1; packages the unit prompt as a DispatchEnvelope and blocks until the swarm worker replies |
| 5 | startSliceParallel (spawn) |
src/resources/extensions/sf/slice-parallel-orchestrator.js |
~62 | child_process.spawn — spawns sf --mode json --print "/autonomous" per slice, each in its own git worktree with SF_SLICE_LOCK / SF_MILESTONE_LOCK set |
Path 5 is architecturally distinct from 1–4: it requires a separate process and a separate working directory per worker. This distinction is preserved throughout the proposal.
1.2 Concrete Pain Experienced
Spawn-based dispatch hung intermittently (8 rounds of convergence)
The original runSingleAgent and defaultAgentRunner used child_process.spawn to call sf -p. Every session startup paid full init cost. Spawned processes lost stderr on Windows, produced unpredictable exit timing, and required polling to detect completion. The conversion to in-process runSubagent (paths 1–3) required eight convergence iterations because there was no shared infrastructure: each site had to independently grow timeout logic, cancellation, partial-output collection, and event forwarding.
If there had been a single AgentRuntime interface from the start, those eight iterations would have been one.
Inbox-state cache bug
dispatchAndWait in swarm-dispatch.js delivers a message via SwarmDispatchLayer._busDispatch then immediately calls runAgentTurn. But runAgentTurn reads the agent's inbox from an in-memory cache with a 30-second refresh window. Messages delivered by a different MessageBus instance — which happens whenever dispatchAndWait is called more than once in quick succession — were silently invisible to the agent because the in-memory view hadn't refreshed.
The workaround is the onlyMessageId path (see agent-runner.js:144): when dispatchAndWait supplies onlyMessageId, runAgentTurn forces an inbox refresh before processing. This is a surgical patch that lives in path 2. Path 3 (runSingleAgentViaSwarm) calls swarmDispatchAndWait but does not supply onlyMessageId — it is susceptible to the same bug under concurrent dispatch. Path 4 (runUnitViaSwarm) also calls swarmDispatchAndWait and does not supply onlyMessageId directly; it relies on the onlyMessageId flow inside dispatchAndWait's internal call to runAgentTurn.
A runtime layer that owns the inbox-drain lifecycle would fix this once.
Checkpoint synthesis
When runUnitViaSwarm (path 4) returns, it must synthesize a checkpoint from the agent's reply, because the worker agent ran in an isolated runSubagent session whose tool calls were never written to the parent session's message ledger. The synthesis logic (run-unit.js:459–498) is ~40 lines of heuristics that exist only in path 4. Path 2's runAgentTurn has analogous reply-reconstruction logic. Path 1 has neither. There is no shared "extract structured output from a subagent result" utility.
Parent-ledger isolation
Paths 1–4 all run agents in isolated in-memory sessions (via SessionManager.inMemory()). None of their tool calls appear in the parent session's message history. This is correct, but it means every caller independently re-implements the message extraction pattern (extractFinalOutput in subagent-runner.ts, getFinalOutput in subagent/index.js, reply field on the DispatchResult in path 4). Inconsistencies in these extractors have caused output to be silently dropped when agents produced multi-block assistant messages.
Swarm singleton per process
SwarmDispatchLayer.getOrCreate maintains a module-level cache keyed by ${basePath}:${swarmName}. Paths 3 and 4 both call swarmDispatchAndWait, which calls SwarmDispatchLayer.getOrCreate. There is no mechanism to invalidate this cache, coordinate swarm topology reloads, or observe swarm state from outside paths 3 and 4. Diagnostic tooling has to reach into the internal cache directly.
1.3 The Cost: One Bug Class, Five Surfaces
Every systemic issue in agent dispatch — timeout inconsistency, cancellation leakage, event forwarding gaps, output extraction bugs, inbox refresh misses, checkpoint synthesis failure — surfaces independently in each of the five paths. Fixing path 1 does not fix path 3. The repair loop and the gate runner have no visibility into which path triggered a failure.
2. Proposed Shape
2.1 Three Orthogonal Axes
Agent dispatch has three independently varying dimensions that are currently conflated with the dispatch mechanism itself:
persistence: "ephemeral" | "session" | "persistent"
isolation: "in-process" | "worker-thread" | "subprocess"
routing: "direct" | "broadcast" | "capability-match"
persistence describes the agent's state lifetime:
"ephemeral"— a fresh in-memory session per dispatch; no state carries over. This is the currentrunSubagentmodel."session"— the agent session persists across multiple prompts within one process lifetime (e.g., the interactive session's parent agent). Not managed by this runtime directly, but callers can hold a session reference."persistent"— the agent's identity, inbox, memory, and context blocks survive process restarts, stored in SQLite viaPersistentAgent. This is the current swarm model.
isolation describes the execution boundary:
"in-process"— the agent runs synchronously in the same Node.js event loop, sharing the process's module cache and memory heap. Correct for the vast majority of cases."worker-thread"— the agent runs in a Node.jsWorkerthread. Shares the process's lifespan but gets its own V8 heap, so a memory-intensive agent or a crash does not corrupt the parent. Slower to start than in-process; faster than subprocess."subprocess"— the agent runs as a completely separate OS process. Required when the agent needs a different working directory (slice-parallel), a different environment, or complete crash isolation with no shared state.
routing describes how a target agent is selected:
"direct"— caller provides an explicitAgentSpec(name, system prompt, tools, model). No topology lookup."broadcast"— envelope is delivered to all registered agents matching a tag predicate."capability-match"— envelope is delivered to the agent whose registered capabilities best match the envelope'sworkModeandunitType. This is the currentAgentSwarm.route()behavior.
2.2 AgentRuntime TypeScript Interface
// packages/coding-agent/src/core/agent-runtime.ts
import type { AgentSessionEvent } from "./agent-session.js";
// ── Agent specification ──────────────────────────────────────────────────────
/** Explicit agent description for "direct" routing. */
export interface AgentSpec {
name: string;
systemPrompt: string;
model?: string;
tools?: string[];
cwd?: string;
}
/** Envelope for "capability-match" or "broadcast" routing. */
export interface DispatchEnvelope {
unitId: string;
unitType: string;
workMode: string;
payload: string;
priority?: number;
scope?: string;
/** Override the matched agent's system prompt. */
executorSystemPrompt?: string;
/** Override the matched agent's tool filter. */
executorTools?: string[];
}
// ── Dispatch options ─────────────────────────────────────────────────────────
export type PersistenceMode = "ephemeral" | "session" | "persistent";
export type IsolationMode = "in-process" | "worker-thread" | "subprocess";
export type RoutingMode = "direct" | "broadcast" | "capability-match";
export interface DispatchOptions {
persistence?: PersistenceMode;
isolation?: IsolationMode;
routing?: RoutingMode;
timeoutMs?: number;
signal?: AbortSignal;
onEvent?: (event: AgentSessionEvent) => void;
}
// ── Results ──────────────────────────────────────────────────────────────────
export interface AgentDispatchResult {
ok: boolean;
output: string;
stderr?: string;
exitCode: number;
/** For persistent agents: the message id the reply was written under. */
replyMessageId?: string;
/** For swarm routing: which agent handled the request. */
targetAgent?: string;
}
// ── Registration ─────────────────────────────────────────────────────────────
/** Persistent-agent topology entry, used for capability-match routing. */
export interface RegisteredAgent {
name: string;
role: string;
tags: string[];
capabilities: string[];
basePath: string;
}
// ── Runtime interface ────────────────────────────────────────────────────────
export interface AgentRuntime {
/**
* Dispatch a task to one agent and await its result.
*
* When routing is "direct", `target` must be an AgentSpec.
* When routing is "capability-match", `target` must be a DispatchEnvelope.
* When routing is "broadcast", `target` must be a DispatchEnvelope; returns
* the first non-error result and cancels the rest.
*/
dispatch(
target: AgentSpec | DispatchEnvelope,
message: string,
options?: DispatchOptions,
): Promise<AgentDispatchResult>;
/**
* Dispatch multiple tasks concurrently and await all results.
* The same options apply to all tasks in the batch.
*/
dispatchBatch(
tasks: Array<{ target: AgentSpec | DispatchEnvelope; message: string }>,
options?: DispatchOptions,
): Promise<AgentDispatchResult[]>;
/**
* Subscribe to all events emitted by agents dispatched through this runtime.
* Returns an unsubscribe function.
*/
subscribe(callback: (event: AgentSessionEvent) => void): () => void;
/**
* Register an agent in the persistent topology so capability-match routing
* can discover it.
*/
registerAgent(agent: RegisteredAgent): void;
/**
* Return all registered agents. Used by diagnostic tooling and /sf swarm status.
*/
getRegisteredAgents(): RegisteredAgent[];
}
2.3 Default Option Values
When options are omitted, the runtime applies context-sensitive defaults:
| Field | Default | Rationale |
|---|---|---|
persistence |
"ephemeral" |
Matches current runSubagent behavior; safe default for all existing callers |
isolation |
"in-process" |
Fastest, fewest moving parts; subprocess spawning only when explicitly requested |
routing |
"direct" |
Existing callers always have an explicit agent spec; capability-match is opt-in |
timeoutMs |
480_000 (8 min) |
Matches current dispatchAndWait default |
2.4 Existing Dispatch Sites Mapped to Runtime Options
| Site | persistence |
isolation |
routing |
Notes |
|---|---|---|---|---|
defaultAgentRunner (headless-triage.ts) |
"ephemeral" |
"in-process" |
"direct" |
Triage agents are stateless; direct spec with prompt-parts composition |
runHeadlessPrompt / runAgentTurn (agent-runner.js) |
"persistent" |
"in-process" |
"capability-match" |
PersistentAgent inbox drain; runtime owns the bus write-back |
runSingleAgent (subagent/index.js) |
"ephemeral" |
"in-process" |
"direct" |
Interactive subagents; no state retention across calls |
runUnitViaSwarm (run-unit.js) |
"persistent" |
"in-process" |
"capability-match" |
Autonomous unit dispatch; runtime owns checkpoint synthesis |
startSliceParallel (slice-parallel-orchestrator.js) |
"ephemeral" |
"subprocess" |
"direct" |
Each worker needs its own cwd and git worktree; subprocess is load-bearing |
2.5 What Stays the Same vs. What Changes
Stays the same:
runSubagentinpackages/coding-agent/src/core/subagent-runner.ts— the runtime uses it as the in-process execution engine. Its interface and behavior are unchanged.swarmDispatchAndWait/SwarmDispatchLayer— the runtime uses these as the persistent+capability-match backend. They remain as implementation details, not public API for dispatch callers.AgentSwarm,PersistentAgent,MessageBus— unchanged. The runtime coordinates through them, not around them.- The
DispatchEnvelopeshape — the runtime accepts it as-is for capability-match routing. - Checkpoint tool protocol — unchanged. The runtime collects events and synthesizes checkpoints exactly as
runUnitViaSwarmdoes today, but in a shared utility.
Changes for callers:
- Callers import
createAgentRuntime()instead of importingrunSubagentorswarmDispatchAndWaitdirectly. - Error representation becomes uniform:
AgentDispatchResult.okis always a boolean;stderrcarries the reason;exitCodefollows the existing convention (0/1/124). - Event subscription uses
runtime.subscribe()rather than threading anonEventcallback through every call. - The swarm singleton lifecycle is owned by the runtime instance, not by the
SwarmDispatchLayermodule-level cache.
3. Isolation Tiers
3.1 In-Process ("in-process")
The default tier. The agent session is created via createAgentSession with SessionManager.inMemory(), same as runSubagent today. Startup cost is ~50 ms (resource loader + session init). Memory is shared with the parent but the session state is isolated to the subagent.
Right for: triage agents, interactive subagents, swarm workers executing unit tasks, solver pass.
Wrong for: agents that crash unpredictably (they take down the parent process), agents that need a different cwd per call (process.cwd is global), agents that must run concurrently with heavy memory footprint.
3.2 Worker Thread ("worker-thread")
Node.js worker_threads.Worker gives each agent its own V8 heap and a clean module scope, while still sharing the process lifespan and startup overhead (no OS process fork). Communication is via MessageChannel. Worker threads do not share process.cwd() — each worker can set its own cwd via the workerData initialization option.
Right for:
- Agents that might OOM (large context windows, image processing).
- Agents loaded from untrusted extension paths where a crash must not kill the parent.
- Concurrent agent execution where memory isolation matters but subprocess overhead is unacceptable.
- Proving the isolation boundary in tests: a worker-thread agent can be killed cleanly without tearing down the test process.
Wrong for: agents that must share module-level singletons with the parent (e.g., the interactive session's AgentSwarm singleton). Worker threads cannot share JS objects across the MessageChannel boundary — only structured-clone-serializable data.
Implementation sketch:
// In agent-runtime.ts, isolation: "worker-thread" branch:
const { Worker, MessageChannel } = await import("node:worker_threads");
const { port1, port2 } = new MessageChannel();
const worker = new Worker(
new URL("./agent-runtime-worker.js", import.meta.url),
{ workerData: { spec, message, options: { ...options, port: port2 } },
transferList: [port2] },
);
// agent-runtime-worker.js calls runSubagent and posts the result back via port1.
The runtime thread drives the worker via port1 events; the worker runs runSubagent in its own thread.
3.3 Subprocess ("subprocess")
child_process.spawn of sf --mode json --print "/autonomous" (or a future sf headless agent-turn) with the caller-supplied cwd, environment, and lock variables. This is the current slice-parallel-orchestrator.js approach, which is correct and must be preserved.
Right for: slice-parallel execution where each worker needs process.cwd() anchored to a different git worktree. The working directory is not just a parameter — it determines which git index, which .sf/sf.db, and which lock context the agent operates in. In-process and worker-thread isolation cannot replicate this.
Wrong for: anything that doesn't need a distinct filesystem root. Subprocess startup takes 2–5 seconds (full Node.js + SF init), not 50 ms.
Note on the subprocess in-process flag (SF_PARALLEL_WORKER): The slice orchestrator already guards against nesting via process.env.SF_PARALLEL_WORKER. The "subprocess" isolation tier will preserve this guard.
3.4 Decision Tree
Does the agent need a different git worktree / cwd?
YES → "subprocess"
Could a crash or OOM kill the parent?
YES (untrusted extension, large context) → "worker-thread"
Default → "in-process"
4. Migration Plan
Each step is self-contained: at the end of the step, the named dispatch site works correctly through the runtime and the old code path is removable. Steps are ordered by blast radius (smallest first).
Step 1: Define AgentRuntime in packages/coding-agent/src/core/agent-runtime.ts
What: Create the interface (Section 2.2), create a DefaultAgentRuntime class that wraps runSubagent (direct, in-process, ephemeral path) and SwarmDispatchLayer.dispatchAndWait (capability-match, in-process, persistent path). No existing call sites are changed.
Effort: ~1 day. The two existing engines are already well-factored; this is wiring.
Blast radius: Zero. The file is new; no imports change.
What proves it works:
- Unit test:
runtime.dispatch(spec, "hello")→result.ok === true,result.outputis non-empty. - Unit test:
runtime.dispatch(envelope, "hello", { routing: "capability-match" })→result.targetAgentis set. - Unit test:
runtime.dispatchBatch([...])→ array of results, length matches input.
Landmines: None at this step. The class is additive.
Step 2: Migrate defaultAgentRunner (headless-triage.ts)
What: Replace the inline runSubagent call at headless-triage.ts:~363 with runtime.dispatch(spec, task, { persistence: "ephemeral", isolation: "in-process", routing: "direct" }). The runtime instance is created at the top of handleTriageRun (or injected via HandleTriageOptions.agentRunner which already exists as an escape hatch).
Effort: ~2 hours.
Blast radius: Smallest of all sites. headless-triage.ts is called only from sf headless triage, not from the autonomous loop. A regression here affects only the triage command, which is manually invocable.
What proves it works:
- Existing triage integration test:
runTriageApplyproduces a validParseTriagePlanResult. - The
agentRunnerinjection point inHandleTriageOptionsremains so tests can still inject a mock runner.
Landmines:
defaultAgentRunneris also passed asagentRunnertorunTriageApplyby some callers; the new runtime must expose the same(agent, task, options) => Promise<AgentRunResult>call signature (or adapt theRunTriageApplyResulttype). Consider keeping theAgentRunnerfunction type as a thin shim overruntime.dispatchduring transition.
Step 3: Migrate runUnitViaSwarm (run-unit.js)
What: runUnitViaSwarm at run-unit.js:~210 already calls swarmDispatchAndWait. Replace it with runtime.dispatch(envelope, prompt, { persistence: "persistent", isolation: "in-process", routing: "capability-match", timeoutMs, onEvent }). The event collector (onEvent) and checkpoint synthesis stay in run-unit.js — they are unit-orchestration concerns, not runtime concerns.
Effort: ~2 hours.
Blast radius: Low. This path is only active when SF_AUTONOMOUS_VIA_SWARM=1. The default autonomous path (pi.sendMessage) is unaffected.
What proves it works:
- The existing
auto-dispatch-canonical-plan.test.mjs(currently modified, per git status) should pass. result.targetAgentandresult.replyMessageIdremain populated sorun-unit.js's outcome extraction works.
Landmines:
runUnitViaSwarmreadsenvelope.executorSystemPromptandenvelope.executorToolsand passes them throughdispatchAndWait → runAgentTurn → runHeadlessPrompt. The runtime must either thread these through theDispatchEnvelopeunchanged (preferred — the envelope is already the contract) or expose them as explicit dispatch options.
Step 4: Migrate runSingleAgent (subagent/index.js)
What: runSingleAgent at subagent/index.js:~1266 calls runSubagent directly. runSingleAgentViaSwarm at ~1140 calls swarmDispatchAndWait. Both are replaced by runtime.dispatch(spec, task, { persistence: "ephemeral", isolation: "in-process", routing: "direct" }) with the swarm variant using routing: "capability-match" and persistence: "persistent".
The SF_SUBAGENT_VIA_SWARM flag becomes a runtime option instead of a code branch: callers that want swarm routing pass { routing: "capability-match" }.
Effort: ~4 hours. This is the most complex call site because of runSingleAgentInCmuxSplit, CMUX-split coordination, and the liveSubagentControllers abort set.
Blast radius: Medium. This path handles all interactive /delegate and /rubber-duck commands. A regression is user-visible immediately.
What proves it works:
- Interactive test:
/delegate rubber-duck "explain X"in an interactive SF session produces output. - Unit test:
runSingleAgentwith an unknown agent name returnsexitCode: 1with the correctstderrmessage. liveSubagentControllersmust still be populated on dispatch soabortAllSubagents()works.
Landmines:
processSubagentEventLineprocesses raw event JSON and updatescurrentResultin place. The runtime'sonEventcallback receivesAgentSessionEventobjects (not raw JSON strings). The event processing logic must be updated to accept typed events, or the shim must re-serialize them.- CMUX split creates two concurrent
runSingleAgentcalls. The runtime must be safe for concurrent dispatch from the same instance.
Step 5: Migrate runAgentTurn / runHeadlessPrompt (agent-runner.js)
What: runHeadlessPrompt at agent-runner.js:~73 calls runSubagent. Replace with runtime.dispatch(spec, prompt, { persistence: "ephemeral", isolation: "in-process", routing: "direct" }). The runAgentTurn orchestration (inbox read, context assembly, bus write-back) stays in agent-runner.js — those are PersistentAgent lifecycle concerns.
Effort: ~3 hours.
Blast radius: Medium-high. This path drives the swarm-agent LLM turns for all dispatchAndWait calls (paths 3 and 4 after Steps 3 and 4 complete). A regression here silently produces empty swarm replies.
What proves it works:
runAgentTurnwith a mock agent that has one unread inbox message →turnsProcessed === 1,responseis non-empty, bus contains a reply message.onlyMessageIdpath still forces inbox refresh.
Landmines:
systemPromptOverrideandtoolsOverrideare currently threaded throughoptsall the way torunSubagent. The runtime must expose these either viaAgentSpecfields (where they belong) or via dispatch options. PreferAgentSpec— the spec is the per-call agent definition.
Step 6: Add worker-thread Isolation Option
What: Implement the "worker-thread" isolation path in DefaultAgentRuntime.dispatch. Create packages/coding-agent/src/core/agent-runtime-worker.ts as the worker entry point. It imports runSubagent and posts results back via MessageChannel.
Effort: ~1 day. Worker thread communication is straightforward for structured-clone-safe data; AgentDispatchResult is trivially serializable.
Blast radius: Zero to existing callers. No existing site uses isolation: "worker-thread". This step is purely additive.
What proves it works:
- Synthetic test: dispatch with
isolation: "worker-thread"→result.ok === true, result matches an equivalentin-processdispatch. - Crash-isolation test: worker that throws after returning a partial result → parent process continues,
result.ok === false,result.stderrcontains the error message. - Memory-isolation test: worker allocates a large buffer → parent heap is unaffected.
Landmines:
AgentSessionEventobjects may contain non-serializable fields (functions, class instances). The worker must emit only the serializable parts viapostMessage. Define anAgentSessionEventSerializablesubset if needed.getAgentDir()usesprocess.envwhich is inherited by worker threads. Verify that the worker correctly resolves the agent directory.
Step 7: Migrate Slice-Parallel to Runtime with isolation: "subprocess"
What: Replace the raw child_process.spawn in slice-parallel-orchestrator.js:~131 (spawnSliceWorker) with runtime.dispatch(spec, "/autonomous", { isolation: "subprocess", persistence: "ephemeral", routing: "direct", ... }) where spec.cwd is the slice's worktree path.
The subprocess isolation tier wraps child_process.spawn of sf --mode json --print "/autonomous" with the same env vars (SF_SLICE_LOCK, SF_MILESTONE_LOCK, SF_PARALLEL_WORKER=1) and worktree cwd.
Effort: ~1 day. The subprocess mechanics are already correct in spawnSliceWorker; this step is wrapping them in the runtime interface so cancellation, budget tracking, and event forwarding use the same path as other tiers.
Blast radius: Medium. Slice-parallel is an infrequently exercised path (requires two or more conflict-free slices and the parallel flag). A regression is visible in parallel autonomous runs.
What proves it works:
startSliceParallelintegration test: spawns two workers, both reach acompletedstate, orchestrator state reflects both.SF_PARALLEL_WORKERguard still fires: callingruntime.dispatch(... isolation: "subprocess")from within a subprocess worker returns an error rather than nesting.
Landmines:
sliceState.workerscurrently stores rawChildProcessreferences forstopSliceParallelto call.kill()on. The runtime must expose an abort mechanism (the existingAbortController/AbortSignalpattern) sostopSliceParallelcan cancel via signal rather than raw pid management.- Budget tracking (
worker.cost) currently requires parsing NDJSON from the subprocess stdout. The runtime should expose a cost-update event so the orchestrator can updatesliceState.totalCostwithout parsing raw output.
Step 8: Decommission Old Dispatch Sites
What: After Steps 2–7 are verified in production, remove:
- The direct
runSubagentimport fromheadless-triage.ts(Step 2). - The direct
swarmDispatchAndWaitimport fromrun-unit.js(Step 3). - The direct
runSubagentandswarmDispatchAndWaitimports fromsubagent/index.js(Step 4). runHeadlessPromptinagent-runner.js(Step 5).- The
SF_SUBAGENT_VIA_SWARMandSF_AUTONOMOUS_VIA_SWARMfeature flags (both paths now use the runtime with the appropriate options). spawnSliceWorkerraw spawn logic (Step 7).
Effort: ~2 hours (deleting code and updating imports).
Blast radius: Zero at this point — the sites have already been migrated and tests are green.
Landmines:
SF_SUBAGENT_VIA_SWARMandSF_AUTONOMOUS_VIA_SWARMmay be set in operator configs or CI environments. Add a deprecation log before removing them. The new routing behavior should be configurable viaAgentRuntimeoptions set at construction time, not via environment variable branches.
Migration Summary
| Step | Description | Effort | Blast Radius |
|---|---|---|---|
| 1 | Define AgentRuntime interface + DefaultAgentRuntime |
1 day | Zero |
| 2 | Migrate defaultAgentRunner (headless-triage) |
2 h | Lowest |
| 3 | Migrate runUnitViaSwarm (run-unit) |
2 h | Low |
| 4 | Migrate runSingleAgent (subagent/index) |
4 h | Medium |
| 5 | Migrate runAgentTurn / runHeadlessPrompt (agent-runner) |
3 h | Medium-high |
| 6 | Add worker-thread isolation tier |
1 day | Zero (additive) |
| 7 | Migrate slice-parallel to isolation: "subprocess" |
1 day | Medium |
| 8 | Decommission old dispatch sites | 2 h | Zero (deletions) |
Total: 8 steps, ~4–5 days of focused work.
5. Non-Goals
5.1 AgentRuntime Is Not a Model Router
Model selection for a given unit type is the job of model-router.js. The runtime passes the caller's AgentSpec.model field through to runSubagent's SubagentConfig.model field unchanged. If the caller does not specify a model, the runtime uses the session default, exactly as runSubagent does today.
The runtime does not invoke computeTaskRequirements, selectModelForUnit, or the Bayesian blender. Model routing is a concern of the dispatch caller, not the dispatch mechanism.
5.2 AgentRuntime Is Not a Tool Registry
Tool availability is determined by the resource loader and session.setActiveToolsByName. The runtime passes AgentSpec.tools through to SubagentConfig.tools unchanged. It does not discover, load, or manage tools.
5.3 AgentRuntime Is Not a Scheduler
The autonomous orchestrator (auto/loop.js) owns the decision of when to dispatch units, in what order, and with what retry policy. The runtime provides the mechanism of dispatch: given a target and a message, execute and return a result. Scheduling, gate running, phase management, and repair logic remain in the orchestrator.
5.4 AgentRuntime Does Not Replace PersistentAgent Lifecycle
PersistentAgent, AgentSwarm, and MessageBus manage the persistent agent topology. The runtime uses these structures; it does not replace them. Agents are still registered via AgentSwarm.register (and exposed through runtime.registerAgent as a thin delegation). The runtime is a dispatch coordinator, not a topology manager.
5.5 AgentRuntime Does Not Wrap the Interactive Session
The user's interactive session (the parent AgentSession) is not managed by AgentRuntime. The runtime is for dispatching subordinate agents, not for managing the top-level conversation. pi.sendMessage in runUnit's non-swarm path is out of scope.
6. Open Questions
6.1 Should persistence: "session" be a separate axis or coupled to agent identity?
The current proposal defines "session" as a persistence mode meaning "the session persists within a single process lifetime but is not written to disk." This is distinct from both "ephemeral" (fresh session per dispatch) and "persistent" (SQLite-backed PersistentAgent).
The question is whether "session persistence" is a property of the dispatch call (as proposed) or a property of the agent's registered identity (i.e., only agents explicitly registered as session-persistent can be dispatched with persistence: "session").
Arguments for coupling to identity: An agent that accumulates session context needs to be found again across calls — which requires registration. An anonymous AgentSpec with persistence: "session" would have nowhere to store its context.
Arguments for keeping on the call: Session persistence is currently implicit in the parent session (which runUnit's non-swarm path uses). Making it an explicit dispatch option would let the runtime manage session caching keyed by AgentSpec.name, which is cleaner than the current implicit coupling.
Proposed resolution for v1: Defer persistence: "session" from the initial implementation. Ship "ephemeral" and "persistent" only. Session-persistent agents can be revisited when the interactive-session parent hand-off is better understood.
6.2 How Does routing: "capability-match" Interact with Capability-Based Agent Selection?
swarm-roles.js defines agent roles (coordinator, worker, scout, reviewer, planner, verifier, scribe, adversary) and tags (role:X, tier:Y, workMode:X). AgentSwarm.route(envelope) selects the target by matching envelope.workMode against agent tags.
The AgentRuntime.registerAgent API proposed in Section 2.2 accepts a capabilities: string[] field. This is forward-looking: future agents may declare fine-grained capabilities ("can-write-tests", "can-run-shell") beyond workMode tags. The routing algorithm would then do a capability intersection rather than a tag prefix match.
Question: Should registerAgent's capabilities replace the existing tag-based routing in AgentSwarm.route, or run in parallel with it?
Proposed resolution: In v1, registerAgent delegates to AgentSwarm.register and capabilities is stored as additional tags. The routing algorithm is unchanged. A future step (after decommission) can replace tag-prefix matching with a proper capability-intersection scorer.
6.3 Should Errors Propagate or Be Returned as Results?
Current dispatch sites are inconsistent:
runSubagentnever throws; it returns{ ok: false, exitCode, stderr }.swarmDispatchAndWaitcan throw (if the underlying_busDispatchthrows a routing error) or return{ reply: null, error: string }.runAgentTurncan return{ error: string }or throw fromrunHeadlessPrompt.startSliceParallelreturns{ started, errors }and never throws.
The proposed AgentDispatchResult follows runSubagent's model: never throw, always return. All error conditions produce { ok: false, exitCode: 1, stderr: reason }.
Rationale: The runtime is called inside long-running orchestrator loops where an unhandled rejection would kill the entire autonomous run. Return-as-result forces callers to inspect the outcome rather than assuming success.
Exception: Construction-time errors (e.g., DefaultAgentRuntime cannot load the swarm because .sf/sf.db is corrupt) may throw, because they indicate a broken environment that the loop cannot recover from.
Open sub-question: Should dispatchBatch return a PromiseSettledResult[]-style array (with per-item status: "fulfilled" | "rejected") or AgentDispatchResult[] (where per-item errors are represented as ok: false)? The current proposal uses AgentDispatchResult[] for uniformity; PromiseSettledResult leaks the promise layer into the runtime contract and is harder to test.
6.4 Should onEvent Be a First-Class Subscription or a Per-Call Callback?
Currently onEvent is threaded as a callback through every level of the call stack: runSubagent → dispatchAndWait → runAgentTurn → runHeadlessPrompt → runSubagent options. This makes event forwarding brittle — if any layer forgets to forward it, events are silently dropped.
The proposed runtime.subscribe(callback) is a first-class subscription on the runtime instance. Any dispatch call on that runtime instance automatically forwards events to all subscribers. Per-call onEvent remains available in DispatchOptions for callers that want call-scoped filtering (e.g., runUnitViaSwarm's checkpoint-detection collector).
Open question: If multiple dispatch calls are in flight concurrently, should the subscriber receive events tagged with the originating unitId / targetAgent? Without tagging, a subscriber cannot distinguish which of two concurrent dispatches emitted a given toolcall_end. The AgentSessionEvent does not currently carry a correlation id.
Proposed resolution: Add a correlationId field to the events emitted by DefaultAgentRuntime.dispatch. The correlation id is the unitId from the envelope or a generated UUID for direct-spec dispatches. Existing subscribers that don't care about correlation ignore the field.
7. Relationship to Existing ADRs
ADR-0000 (Purpose-to-Software Compiler)
The AgentRuntime is an implementation detail of step 5 of the compiler pipeline ("generate milestone, slice, task, and artifact contracts from structured state"). It does not change what agents do; it standardizes how they are dispatched. The compiler's product contract — PDD fields, run-control policy, gate runners — is unchanged.
ADR-0075 (UOK Gate Architecture)
Gate runners currently call runAgentTurn indirectly through swarmDispatchAndWait. After migration, they will call runtime.dispatch with persistence: "persistent" and routing: "capability-match". The gate contract (execute(ctx, attempt) → GateResult) is unchanged; only the internal dispatch mechanism changes.
ADR-0079 (Solver / Executor Separation)
The solver pass proposed in ADR-0079 is exactly runtime.dispatch(solverSpec, executorTranscript, { persistence: "ephemeral", isolation: "in-process", routing: "direct" }). The solver model selection (resolveSolverModel) remains in model-router.js; the runtime receives the resolved model via AgentSpec.model. The two-pass architecture (executor dispatch → solver dispatch) maps cleanly onto runtime.dispatch called twice from runUnit.
Appendix A: File Locations After Migration
| New File | Purpose |
|---|---|
packages/coding-agent/src/core/agent-runtime.ts |
AgentRuntime interface + DefaultAgentRuntime class |
packages/coding-agent/src/core/agent-runtime-worker.ts |
Worker thread entry point for isolation: "worker-thread" |
| Modified File | Change |
|---|---|
src/headless-triage.ts |
Import createAgentRuntime instead of runSubagent |
src/resources/extensions/sf/auto/run-unit.js |
runUnitViaSwarm uses runtime.dispatch |
src/resources/extensions/sf/subagent/index.js |
runSingleAgent + runSingleAgentViaSwarm use runtime.dispatch |
src/resources/extensions/sf/uok/agent-runner.js |
runHeadlessPrompt uses runtime.dispatch |
src/resources/extensions/sf/slice-parallel-orchestrator.js |
spawnSliceWorker uses runtime.dispatch with isolation: "subprocess" |
No existing exported symbols from subagent-runner.ts, swarm-dispatch.js, or agent-runner.js are removed in the migration phase. They become internal implementation details behind the runtime abstraction and are removed only in Step 8.
Appendix B: Invariants the Runtime Must Preserve
- Never throw from
dispatchordispatchBatchunder normal failure conditions (agent timeout, LLM error, routing miss). Returnok: false. runSubagentis the sole in-process LLM execution engine. The runtime wraps it; it does not duplicate its session-management logic.- Inbox refresh contract: When routing is
"capability-match"and persistence is"persistent", the runtime forces an inbox refresh before driving the agent turn. This fixes the inbox-state cache bug (Section 1.2) for all callers. - Subprocess isolation guard: When
isolation: "subprocess"andSF_PARALLEL_WORKERis set in the current process,dispatchreturnsok: falsewithstderr: "cannot nest subprocess dispatch from within a parallel worker". - Cancellation propagates to the execution engine. When the caller's
AbortSignalfires, the runtime cancels the in-flightrunSubagentcall (or the subprocess, or the worker thread) and returns{ ok: false, exitCode: 1, stderr: "cancelled" }. It does not leak the in-flight execution.