docs: AgentRuntime unification proposal

Design doc for collapsing the five parallel agent-dispatch sites
(defaultAgentRunner, runHeadlessPrompt, runSingleAgent, runUnitViaSwarm,
slice-parallel-orchestrator) onto one runtime with three orthogonal
axes — persistence, isolation, routing.

590 lines, ~5200 words. Includes:
- Problem statement with five concrete pain points from this session's
  swarm convergence rounds (spawn hangs, inbox cache, checkpoint
  synthesis, ledger isolation, etc.)
- Worked-out TypeScript interface
- Mapping of each existing site to runtime options (table)
- 8-step migration plan in blast-radius order (~4-5 days focused work)
- Open questions

No source-code changes. Implementation comes later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mikael Hugo 2026-05-15 06:32:28 +02:00
parent 1e99bd669e
commit 1478579069

View file

@ -0,0 +1,590 @@
# Proposal: Unified `AgentRuntime` Abstraction
**Status:** Draft
**Date:** 2026-05-15
**Author:** Mikael Hugo
**Related ADRs:** ADR-0000 (purpose-to-software compiler), ADR-0075 (UOK gate architecture), ADR-0079 (solver/executor separation)
---
## 1. Problem Statement
### 1.1 Five Parallel Dispatch Paths
SF currently dispatches agent work through five entirely separate code paths. Each path independently manages cancellation, timeouts, event collection, error representation, and (where relevant) swarm routing. They share no interface, no contract, and no test fixture.
| # | Name | File | Line | Mechanism |
|---|------|------|------|-----------|
| 1 | `defaultAgentRunner` | `src/headless-triage.ts` | ~363 | `runSubagent` directly — composes system prompt via `subagent/prompt-parts`, fires and awaits one `runSubagent` call per triage agent |
| 2 | `runHeadlessPrompt` / `runAgentTurn` | `src/resources/extensions/sf/uok/agent-runner.js` | ~73 / ~129 | `runSubagent` through a two-layer wrapper — `runAgentTurn` reads the PersistentAgent inbox, builds an assembled prompt, calls `runHeadlessPrompt`, then writes a reply back to the `MessageBus` |
| 3 | `runSingleAgent` / `runSingleAgentViaSwarm` | `src/resources/extensions/sf/subagent/index.js` | ~1266 / ~1140 | `runSubagent` directly (default path) or `swarmDispatchAndWait` (flag `SF_SUBAGENT_VIA_SWARM=1`) — used by `/delegate`, `/rubber-duck`, and all interactive subagent commands |
| 4 | `runUnitViaSwarm` | `src/resources/extensions/sf/auto/run-unit.js` | ~210 | `swarmDispatchAndWait` — activated by `SF_AUTONOMOUS_VIA_SWARM=1`; packages the unit prompt as a `DispatchEnvelope` and blocks until the swarm worker replies |
| 5 | `startSliceParallel` (spawn) | `src/resources/extensions/sf/slice-parallel-orchestrator.js` | ~62 | `child_process.spawn` — spawns `sf --mode json --print "/autonomous"` per slice, each in its own git worktree with `SF_SLICE_LOCK` / `SF_MILESTONE_LOCK` set |
Path 5 is architecturally distinct from 14: it requires a separate process *and* a separate working directory per worker. This distinction is preserved throughout the proposal.
### 1.2 Concrete Pain Experienced
#### Spawn-based dispatch hung intermittently (8 rounds of convergence)
The original `runSingleAgent` and `defaultAgentRunner` used `child_process.spawn` to call `sf -p`. Every session startup paid full init cost. Spawned processes lost stderr on Windows, produced unpredictable exit timing, and required polling to detect completion. The conversion to in-process `runSubagent` (paths 13) required eight convergence iterations because there was no shared infrastructure: each site had to independently grow timeout logic, cancellation, partial-output collection, and event forwarding.
If there had been a single `AgentRuntime` interface from the start, those eight iterations would have been one.
#### Inbox-state cache bug
`dispatchAndWait` in `swarm-dispatch.js` delivers a message via `SwarmDispatchLayer._busDispatch` then immediately calls `runAgentTurn`. But `runAgentTurn` reads the agent's inbox from an in-memory cache with a 30-second refresh window. Messages delivered by a *different* `MessageBus` instance — which happens whenever `dispatchAndWait` is called more than once in quick succession — were silently invisible to the agent because the in-memory view hadn't refreshed.
The workaround is the `onlyMessageId` path (see `agent-runner.js:144`): when `dispatchAndWait` supplies `onlyMessageId`, `runAgentTurn` forces an inbox refresh before processing. This is a surgical patch that lives in path 2. Path 3 (`runSingleAgentViaSwarm`) calls `swarmDispatchAndWait` but does not supply `onlyMessageId` — it is susceptible to the same bug under concurrent dispatch. Path 4 (`runUnitViaSwarm`) also calls `swarmDispatchAndWait` and does not supply `onlyMessageId` directly; it relies on the `onlyMessageId` flow inside `dispatchAndWait`'s internal call to `runAgentTurn`.
A runtime layer that owns the inbox-drain lifecycle would fix this once.
#### Checkpoint synthesis
When `runUnitViaSwarm` (path 4) returns, it must synthesize a checkpoint from the agent's reply, because the worker agent ran in an isolated `runSubagent` session whose tool calls were never written to the parent session's message ledger. The synthesis logic (`run-unit.js:459498`) is ~40 lines of heuristics that exist only in path 4. Path 2's `runAgentTurn` has analogous reply-reconstruction logic. Path 1 has neither. There is no shared "extract structured output from a subagent result" utility.
#### Parent-ledger isolation
Paths 14 all run agents in isolated in-memory sessions (via `SessionManager.inMemory()`). None of their tool calls appear in the parent session's message history. This is correct, but it means every caller independently re-implements the message extraction pattern (`extractFinalOutput` in `subagent-runner.ts`, `getFinalOutput` in `subagent/index.js`, `reply` field on the `DispatchResult` in path 4). Inconsistencies in these extractors have caused output to be silently dropped when agents produced multi-block assistant messages.
#### Swarm singleton per process
`SwarmDispatchLayer.getOrCreate` maintains a module-level cache keyed by `${basePath}:${swarmName}`. Paths 3 and 4 both call `swarmDispatchAndWait`, which calls `SwarmDispatchLayer.getOrCreate`. There is no mechanism to invalidate this cache, coordinate swarm topology reloads, or observe swarm state from outside paths 3 and 4. Diagnostic tooling has to reach into the internal cache directly.
### 1.3 The Cost: One Bug Class, Five Surfaces
Every systemic issue in agent dispatch — timeout inconsistency, cancellation leakage, event forwarding gaps, output extraction bugs, inbox refresh misses, checkpoint synthesis failure — surfaces independently in each of the five paths. Fixing path 1 does not fix path 3. The repair loop and the gate runner have no visibility into which path triggered a failure.
---
## 2. Proposed Shape
### 2.1 Three Orthogonal Axes
Agent dispatch has three independently varying dimensions that are currently conflated with the dispatch mechanism itself:
```
persistence: "ephemeral" | "session" | "persistent"
isolation: "in-process" | "worker-thread" | "subprocess"
routing: "direct" | "broadcast" | "capability-match"
```
**persistence** describes the agent's state lifetime:
- `"ephemeral"` — a fresh in-memory session per dispatch; no state carries over. This is the current `runSubagent` model.
- `"session"` — the agent session persists across multiple prompts within one process lifetime (e.g., the interactive session's parent agent). Not managed by this runtime directly, but callers can hold a session reference.
- `"persistent"` — the agent's identity, inbox, memory, and context blocks survive process restarts, stored in SQLite via `PersistentAgent`. This is the current swarm model.
**isolation** describes the execution boundary:
- `"in-process"` — the agent runs synchronously in the same Node.js event loop, sharing the process's module cache and memory heap. Correct for the vast majority of cases.
- `"worker-thread"` — the agent runs in a Node.js `Worker` thread. Shares the process's lifespan but gets its own V8 heap, so a memory-intensive agent or a crash does not corrupt the parent. Slower to start than in-process; faster than subprocess.
- `"subprocess"` — the agent runs as a completely separate OS process. Required when the agent needs a different working directory (slice-parallel), a different environment, or complete crash isolation with no shared state.
**routing** describes how a target agent is selected:
- `"direct"` — caller provides an explicit `AgentSpec` (name, system prompt, tools, model). No topology lookup.
- `"broadcast"` — envelope is delivered to all registered agents matching a tag predicate.
- `"capability-match"` — envelope is delivered to the agent whose registered capabilities best match the envelope's `workMode` and `unitType`. This is the current `AgentSwarm.route()` behavior.
### 2.2 `AgentRuntime` TypeScript Interface
```typescript
// packages/coding-agent/src/core/agent-runtime.ts
import type { AgentSessionEvent } from "./agent-session.js";
// ── Agent specification ──────────────────────────────────────────────────────
/** Explicit agent description for "direct" routing. */
export interface AgentSpec {
name: string;
systemPrompt: string;
model?: string;
tools?: string[];
cwd?: string;
}
/** Envelope for "capability-match" or "broadcast" routing. */
export interface DispatchEnvelope {
unitId: string;
unitType: string;
workMode: string;
payload: string;
priority?: number;
scope?: string;
/** Override the matched agent's system prompt. */
executorSystemPrompt?: string;
/** Override the matched agent's tool filter. */
executorTools?: string[];
}
// ── Dispatch options ─────────────────────────────────────────────────────────
export type PersistenceMode = "ephemeral" | "session" | "persistent";
export type IsolationMode = "in-process" | "worker-thread" | "subprocess";
export type RoutingMode = "direct" | "broadcast" | "capability-match";
export interface DispatchOptions {
persistence?: PersistenceMode;
isolation?: IsolationMode;
routing?: RoutingMode;
timeoutMs?: number;
signal?: AbortSignal;
onEvent?: (event: AgentSessionEvent) => void;
}
// ── Results ──────────────────────────────────────────────────────────────────
export interface AgentDispatchResult {
ok: boolean;
output: string;
stderr?: string;
exitCode: number;
/** For persistent agents: the message id the reply was written under. */
replyMessageId?: string;
/** For swarm routing: which agent handled the request. */
targetAgent?: string;
}
// ── Registration ─────────────────────────────────────────────────────────────
/** Persistent-agent topology entry, used for capability-match routing. */
export interface RegisteredAgent {
name: string;
role: string;
tags: string[];
capabilities: string[];
basePath: string;
}
// ── Runtime interface ────────────────────────────────────────────────────────
export interface AgentRuntime {
/**
* Dispatch a task to one agent and await its result.
*
* When routing is "direct", `target` must be an AgentSpec.
* When routing is "capability-match", `target` must be a DispatchEnvelope.
* When routing is "broadcast", `target` must be a DispatchEnvelope; returns
* the first non-error result and cancels the rest.
*/
dispatch(
target: AgentSpec | DispatchEnvelope,
message: string,
options?: DispatchOptions,
): Promise<AgentDispatchResult>;
/**
* Dispatch multiple tasks concurrently and await all results.
* The same options apply to all tasks in the batch.
*/
dispatchBatch(
tasks: Array<{ target: AgentSpec | DispatchEnvelope; message: string }>,
options?: DispatchOptions,
): Promise<AgentDispatchResult[]>;
/**
* Subscribe to all events emitted by agents dispatched through this runtime.
* Returns an unsubscribe function.
*/
subscribe(callback: (event: AgentSessionEvent) => void): () => void;
/**
* Register an agent in the persistent topology so capability-match routing
* can discover it.
*/
registerAgent(agent: RegisteredAgent): void;
/**
* Return all registered agents. Used by diagnostic tooling and /sf swarm status.
*/
getRegisteredAgents(): RegisteredAgent[];
}
```
### 2.3 Default Option Values
When options are omitted, the runtime applies context-sensitive defaults:
| Field | Default | Rationale |
|-------|---------|-----------|
| `persistence` | `"ephemeral"` | Matches current `runSubagent` behavior; safe default for all existing callers |
| `isolation` | `"in-process"` | Fastest, fewest moving parts; subprocess spawning only when explicitly requested |
| `routing` | `"direct"` | Existing callers always have an explicit agent spec; capability-match is opt-in |
| `timeoutMs` | `480_000` (8 min) | Matches current `dispatchAndWait` default |
### 2.4 Existing Dispatch Sites Mapped to Runtime Options
| Site | `persistence` | `isolation` | `routing` | Notes |
|------|--------------|-------------|-----------|-------|
| `defaultAgentRunner` (headless-triage.ts) | `"ephemeral"` | `"in-process"` | `"direct"` | Triage agents are stateless; direct spec with prompt-parts composition |
| `runHeadlessPrompt` / `runAgentTurn` (agent-runner.js) | `"persistent"` | `"in-process"` | `"capability-match"` | PersistentAgent inbox drain; runtime owns the bus write-back |
| `runSingleAgent` (subagent/index.js) | `"ephemeral"` | `"in-process"` | `"direct"` | Interactive subagents; no state retention across calls |
| `runUnitViaSwarm` (run-unit.js) | `"persistent"` | `"in-process"` | `"capability-match"` | Autonomous unit dispatch; runtime owns checkpoint synthesis |
| `startSliceParallel` (slice-parallel-orchestrator.js) | `"ephemeral"` | `"subprocess"` | `"direct"` | Each worker needs its own cwd and git worktree; subprocess is load-bearing |
### 2.5 What Stays the Same vs. What Changes
**Stays the same:**
- `runSubagent` in `packages/coding-agent/src/core/subagent-runner.ts` — the runtime uses it as the in-process execution engine. Its interface and behavior are unchanged.
- `swarmDispatchAndWait` / `SwarmDispatchLayer` — the runtime uses these as the persistent+capability-match backend. They remain as implementation details, not public API for dispatch callers.
- `AgentSwarm`, `PersistentAgent`, `MessageBus` — unchanged. The runtime coordinates through them, not around them.
- The `DispatchEnvelope` shape — the runtime accepts it as-is for capability-match routing.
- Checkpoint tool protocol — unchanged. The runtime collects events and synthesizes checkpoints exactly as `runUnitViaSwarm` does today, but in a shared utility.
**Changes for callers:**
- Callers import `createAgentRuntime()` instead of importing `runSubagent` or `swarmDispatchAndWait` directly.
- Error representation becomes uniform: `AgentDispatchResult.ok` is always a boolean; `stderr` carries the reason; `exitCode` follows the existing convention (0/1/124).
- Event subscription uses `runtime.subscribe()` rather than threading an `onEvent` callback through every call.
- The swarm singleton lifecycle is owned by the runtime instance, not by the `SwarmDispatchLayer` module-level cache.
---
## 3. Isolation Tiers
### 3.1 In-Process (`"in-process"`)
The default tier. The agent session is created via `createAgentSession` with `SessionManager.inMemory()`, same as `runSubagent` today. Startup cost is ~50 ms (resource loader + session init). Memory is shared with the parent but the session state is isolated to the subagent.
**Right for:** triage agents, interactive subagents, swarm workers executing unit tasks, solver pass.
**Wrong for:** agents that crash unpredictably (they take down the parent process), agents that need a different `cwd` per call (process.cwd is global), agents that must run concurrently with heavy memory footprint.
### 3.2 Worker Thread (`"worker-thread"`)
Node.js `worker_threads.Worker` gives each agent its own V8 heap and a clean module scope, while still sharing the process lifespan and startup overhead (no OS process fork). Communication is via `MessageChannel`. Worker threads do not share `process.cwd()` — each worker can set its own cwd via the `workerData` initialization option.
**Right for:**
- Agents that might OOM (large context windows, image processing).
- Agents loaded from untrusted extension paths where a crash must not kill the parent.
- Concurrent agent execution where memory isolation matters but subprocess overhead is unacceptable.
- Proving the isolation boundary in tests: a worker-thread agent can be killed cleanly without tearing down the test process.
**Wrong for:** agents that must share module-level singletons with the parent (e.g., the interactive session's `AgentSwarm` singleton). Worker threads cannot share JS objects across the `MessageChannel` boundary — only structured-clone-serializable data.
**Implementation sketch:**
```
// In agent-runtime.ts, isolation: "worker-thread" branch:
const { Worker, MessageChannel } = await import("node:worker_threads");
const { port1, port2 } = new MessageChannel();
const worker = new Worker(
new URL("./agent-runtime-worker.js", import.meta.url),
{ workerData: { spec, message, options: { ...options, port: port2 } },
transferList: [port2] },
);
// agent-runtime-worker.js calls runSubagent and posts the result back via port1.
```
The runtime thread drives the worker via `port1` events; the worker runs `runSubagent` in its own thread.
### 3.3 Subprocess (`"subprocess"`)
`child_process.spawn` of `sf --mode json --print "/autonomous"` (or a future `sf headless agent-turn`) with the caller-supplied `cwd`, environment, and lock variables. This is the current `slice-parallel-orchestrator.js` approach, which is correct and must be preserved.
**Right for:** slice-parallel execution where each worker needs `process.cwd()` anchored to a different git worktree. The working directory is not just a parameter — it determines which git index, which `.sf/sf.db`, and which lock context the agent operates in. In-process and worker-thread isolation cannot replicate this.
**Wrong for:** anything that doesn't need a distinct filesystem root. Subprocess startup takes 25 seconds (full Node.js + SF init), not 50 ms.
**Note on the subprocess in-process flag (`SF_PARALLEL_WORKER`):** The slice orchestrator already guards against nesting via `process.env.SF_PARALLEL_WORKER`. The `"subprocess"` isolation tier will preserve this guard.
### 3.4 Decision Tree
```
Does the agent need a different git worktree / cwd?
YES → "subprocess"
Could a crash or OOM kill the parent?
YES (untrusted extension, large context) → "worker-thread"
Default → "in-process"
```
---
## 4. Migration Plan
Each step is self-contained: at the end of the step, the named dispatch site works correctly through the runtime and the old code path is removable. Steps are ordered by blast radius (smallest first).
### Step 1: Define `AgentRuntime` in `packages/coding-agent/src/core/agent-runtime.ts`
**What:** Create the interface (Section 2.2), create a `DefaultAgentRuntime` class that wraps `runSubagent` (direct, in-process, ephemeral path) and `SwarmDispatchLayer.dispatchAndWait` (capability-match, in-process, persistent path). No existing call sites are changed.
**Effort:** ~1 day. The two existing engines are already well-factored; this is wiring.
**Blast radius:** Zero. The file is new; no imports change.
**What proves it works:**
- Unit test: `runtime.dispatch(spec, "hello")``result.ok === true`, `result.output` is non-empty.
- Unit test: `runtime.dispatch(envelope, "hello", { routing: "capability-match" })``result.targetAgent` is set.
- Unit test: `runtime.dispatchBatch([...])` → array of results, length matches input.
**Landmines:** None at this step. The class is additive.
---
### Step 2: Migrate `defaultAgentRunner` (headless-triage.ts)
**What:** Replace the inline `runSubagent` call at `headless-triage.ts:~363` with `runtime.dispatch(spec, task, { persistence: "ephemeral", isolation: "in-process", routing: "direct" })`. The runtime instance is created at the top of `handleTriageRun` (or injected via `HandleTriageOptions.agentRunner` which already exists as an escape hatch).
**Effort:** ~2 hours.
**Blast radius:** Smallest of all sites. `headless-triage.ts` is called only from `sf headless triage`, not from the autonomous loop. A regression here affects only the triage command, which is manually invocable.
**What proves it works:**
- Existing triage integration test: `runTriageApply` produces a valid `ParseTriagePlanResult`.
- The `agentRunner` injection point in `HandleTriageOptions` remains so tests can still inject a mock runner.
**Landmines:**
- `defaultAgentRunner` is also passed as `agentRunner` to `runTriageApply` by some callers; the new runtime must expose the same `(agent, task, options) => Promise<AgentRunResult>` call signature (or adapt the `RunTriageApplyResult` type). Consider keeping the `AgentRunner` function type as a thin shim over `runtime.dispatch` during transition.
---
### Step 3: Migrate `runUnitViaSwarm` (run-unit.js)
**What:** `runUnitViaSwarm` at `run-unit.js:~210` already calls `swarmDispatchAndWait`. Replace it with `runtime.dispatch(envelope, prompt, { persistence: "persistent", isolation: "in-process", routing: "capability-match", timeoutMs, onEvent })`. The event collector (`onEvent`) and checkpoint synthesis stay in `run-unit.js` — they are unit-orchestration concerns, not runtime concerns.
**Effort:** ~2 hours.
**Blast radius:** Low. This path is only active when `SF_AUTONOMOUS_VIA_SWARM=1`. The default autonomous path (`pi.sendMessage`) is unaffected.
**What proves it works:**
- The existing `auto-dispatch-canonical-plan.test.mjs` (currently modified, per git status) should pass.
- `result.targetAgent` and `result.replyMessageId` remain populated so `run-unit.js`'s outcome extraction works.
**Landmines:**
- `runUnitViaSwarm` reads `envelope.executorSystemPrompt` and `envelope.executorTools` and passes them through `dispatchAndWait → runAgentTurn → runHeadlessPrompt`. The runtime must either thread these through the `DispatchEnvelope` unchanged (preferred — the envelope is already the contract) or expose them as explicit dispatch options.
---
### Step 4: Migrate `runSingleAgent` (subagent/index.js)
**What:** `runSingleAgent` at `subagent/index.js:~1266` calls `runSubagent` directly. `runSingleAgentViaSwarm` at `~1140` calls `swarmDispatchAndWait`. Both are replaced by `runtime.dispatch(spec, task, { persistence: "ephemeral", isolation: "in-process", routing: "direct" })` with the swarm variant using `routing: "capability-match"` and `persistence: "persistent"`.
The `SF_SUBAGENT_VIA_SWARM` flag becomes a runtime option instead of a code branch: callers that want swarm routing pass `{ routing: "capability-match" }`.
**Effort:** ~4 hours. This is the most complex call site because of `runSingleAgentInCmuxSplit`, CMUX-split coordination, and the `liveSubagentControllers` abort set.
**Blast radius:** Medium. This path handles all interactive `/delegate` and `/rubber-duck` commands. A regression is user-visible immediately.
**What proves it works:**
- Interactive test: `/delegate rubber-duck "explain X"` in an interactive SF session produces output.
- Unit test: `runSingleAgent` with an unknown agent name returns `exitCode: 1` with the correct `stderr` message.
- `liveSubagentControllers` must still be populated on dispatch so `abortAllSubagents()` works.
**Landmines:**
- `processSubagentEventLine` processes raw event JSON and updates `currentResult` in place. The runtime's `onEvent` callback receives `AgentSessionEvent` objects (not raw JSON strings). The event processing logic must be updated to accept typed events, or the shim must re-serialize them.
- CMUX split creates two concurrent `runSingleAgent` calls. The runtime must be safe for concurrent dispatch from the same instance.
---
### Step 5: Migrate `runAgentTurn` / `runHeadlessPrompt` (agent-runner.js)
**What:** `runHeadlessPrompt` at `agent-runner.js:~73` calls `runSubagent`. Replace with `runtime.dispatch(spec, prompt, { persistence: "ephemeral", isolation: "in-process", routing: "direct" })`. The `runAgentTurn` orchestration (inbox read, context assembly, bus write-back) stays in `agent-runner.js` — those are `PersistentAgent` lifecycle concerns.
**Effort:** ~3 hours.
**Blast radius:** Medium-high. This path drives the swarm-agent LLM turns for all `dispatchAndWait` calls (paths 3 and 4 after Steps 3 and 4 complete). A regression here silently produces empty swarm replies.
**What proves it works:**
- `runAgentTurn` with a mock agent that has one unread inbox message → `turnsProcessed === 1`, `response` is non-empty, bus contains a reply message.
- `onlyMessageId` path still forces inbox refresh.
**Landmines:**
- `systemPromptOverride` and `toolsOverride` are currently threaded through `opts` all the way to `runSubagent`. The runtime must expose these either via `AgentSpec` fields (where they belong) or via dispatch options. Prefer `AgentSpec` — the spec is the per-call agent definition.
---
### Step 6: Add `worker-thread` Isolation Option
**What:** Implement the `"worker-thread"` isolation path in `DefaultAgentRuntime.dispatch`. Create `packages/coding-agent/src/core/agent-runtime-worker.ts` as the worker entry point. It imports `runSubagent` and posts results back via `MessageChannel`.
**Effort:** ~1 day. Worker thread communication is straightforward for structured-clone-safe data; `AgentDispatchResult` is trivially serializable.
**Blast radius:** Zero to existing callers. No existing site uses `isolation: "worker-thread"`. This step is purely additive.
**What proves it works:**
- Synthetic test: dispatch with `isolation: "worker-thread"``result.ok === true`, result matches an equivalent `in-process` dispatch.
- Crash-isolation test: worker that throws after returning a partial result → parent process continues, `result.ok === false`, `result.stderr` contains the error message.
- Memory-isolation test: worker allocates a large buffer → parent heap is unaffected.
**Landmines:**
- `AgentSessionEvent` objects may contain non-serializable fields (functions, class instances). The worker must emit only the serializable parts via `postMessage`. Define an `AgentSessionEventSerializable` subset if needed.
- `getAgentDir()` uses `process.env` which is inherited by worker threads. Verify that the worker correctly resolves the agent directory.
---
### Step 7: Migrate Slice-Parallel to Runtime with `isolation: "subprocess"`
**What:** Replace the raw `child_process.spawn` in `slice-parallel-orchestrator.js:~131` (`spawnSliceWorker`) with `runtime.dispatch(spec, "/autonomous", { isolation: "subprocess", persistence: "ephemeral", routing: "direct", ... })` where `spec.cwd` is the slice's worktree path.
The subprocess isolation tier wraps `child_process.spawn` of `sf --mode json --print "/autonomous"` with the same env vars (`SF_SLICE_LOCK`, `SF_MILESTONE_LOCK`, `SF_PARALLEL_WORKER=1`) and worktree cwd.
**Effort:** ~1 day. The subprocess mechanics are already correct in `spawnSliceWorker`; this step is wrapping them in the runtime interface so cancellation, budget tracking, and event forwarding use the same path as other tiers.
**Blast radius:** Medium. Slice-parallel is an infrequently exercised path (requires two or more conflict-free slices and the parallel flag). A regression is visible in parallel autonomous runs.
**What proves it works:**
- `startSliceParallel` integration test: spawns two workers, both reach a `completed` state, orchestrator state reflects both.
- `SF_PARALLEL_WORKER` guard still fires: calling `runtime.dispatch(... isolation: "subprocess")` from within a subprocess worker returns an error rather than nesting.
**Landmines:**
- `sliceState.workers` currently stores raw `ChildProcess` references for `stopSliceParallel` to call `.kill()` on. The runtime must expose an abort mechanism (the existing `AbortController` / `AbortSignal` pattern) so `stopSliceParallel` can cancel via signal rather than raw pid management.
- Budget tracking (`worker.cost`) currently requires parsing NDJSON from the subprocess stdout. The runtime should expose a cost-update event so the orchestrator can update `sliceState.totalCost` without parsing raw output.
---
### Step 8: Decommission Old Dispatch Sites
**What:** After Steps 27 are verified in production, remove:
- The direct `runSubagent` import from `headless-triage.ts` (Step 2).
- The direct `swarmDispatchAndWait` import from `run-unit.js` (Step 3).
- The direct `runSubagent` and `swarmDispatchAndWait` imports from `subagent/index.js` (Step 4).
- `runHeadlessPrompt` in `agent-runner.js` (Step 5).
- The `SF_SUBAGENT_VIA_SWARM` and `SF_AUTONOMOUS_VIA_SWARM` feature flags (both paths now use the runtime with the appropriate options).
- `spawnSliceWorker` raw spawn logic (Step 7).
**Effort:** ~2 hours (deleting code and updating imports).
**Blast radius:** Zero at this point — the sites have already been migrated and tests are green.
**Landmines:**
- `SF_SUBAGENT_VIA_SWARM` and `SF_AUTONOMOUS_VIA_SWARM` may be set in operator configs or CI environments. Add a deprecation log before removing them. The new routing behavior should be configurable via `AgentRuntime` options set at construction time, not via environment variable branches.
---
### Migration Summary
| Step | Description | Effort | Blast Radius |
|------|-------------|--------|--------------|
| 1 | Define `AgentRuntime` interface + `DefaultAgentRuntime` | 1 day | Zero |
| 2 | Migrate `defaultAgentRunner` (headless-triage) | 2 h | Lowest |
| 3 | Migrate `runUnitViaSwarm` (run-unit) | 2 h | Low |
| 4 | Migrate `runSingleAgent` (subagent/index) | 4 h | Medium |
| 5 | Migrate `runAgentTurn` / `runHeadlessPrompt` (agent-runner) | 3 h | Medium-high |
| 6 | Add `worker-thread` isolation tier | 1 day | Zero (additive) |
| 7 | Migrate slice-parallel to `isolation: "subprocess"` | 1 day | Medium |
| 8 | Decommission old dispatch sites | 2 h | Zero (deletions) |
Total: 8 steps, ~45 days of focused work.
---
## 5. Non-Goals
### 5.1 `AgentRuntime` Is Not a Model Router
Model selection for a given unit type is the job of `model-router.js`. The runtime passes the caller's `AgentSpec.model` field through to `runSubagent`'s `SubagentConfig.model` field unchanged. If the caller does not specify a model, the runtime uses the session default, exactly as `runSubagent` does today.
The runtime does not invoke `computeTaskRequirements`, `selectModelForUnit`, or the Bayesian blender. Model routing is a concern of the dispatch caller, not the dispatch mechanism.
### 5.2 `AgentRuntime` Is Not a Tool Registry
Tool availability is determined by the resource loader and `session.setActiveToolsByName`. The runtime passes `AgentSpec.tools` through to `SubagentConfig.tools` unchanged. It does not discover, load, or manage tools.
### 5.3 `AgentRuntime` Is Not a Scheduler
The autonomous orchestrator (`auto/loop.js`) owns the decision of *when* to dispatch units, in what order, and with what retry policy. The runtime provides the *mechanism* of dispatch: given a target and a message, execute and return a result. Scheduling, gate running, phase management, and repair logic remain in the orchestrator.
### 5.4 `AgentRuntime` Does Not Replace `PersistentAgent` Lifecycle
`PersistentAgent`, `AgentSwarm`, and `MessageBus` manage the persistent agent topology. The runtime *uses* these structures; it does not replace them. Agents are still registered via `AgentSwarm.register` (and exposed through `runtime.registerAgent` as a thin delegation). The runtime is a dispatch coordinator, not a topology manager.
### 5.5 `AgentRuntime` Does Not Wrap the Interactive Session
The user's interactive session (the parent `AgentSession`) is not managed by `AgentRuntime`. The runtime is for dispatching subordinate agents, not for managing the top-level conversation. `pi.sendMessage` in `runUnit`'s non-swarm path is out of scope.
---
## 6. Open Questions
### 6.1 Should `persistence: "session"` be a separate axis or coupled to agent identity?
The current proposal defines `"session"` as a persistence mode meaning "the session persists within a single process lifetime but is not written to disk." This is distinct from both `"ephemeral"` (fresh session per dispatch) and `"persistent"` (SQLite-backed PersistentAgent).
The question is whether "session persistence" is a property of the *dispatch call* (as proposed) or a property of the *agent's registered identity* (i.e., only agents explicitly registered as session-persistent can be dispatched with `persistence: "session"`).
**Arguments for coupling to identity:** An agent that accumulates session context needs to be found again across calls — which requires registration. An anonymous `AgentSpec` with `persistence: "session"` would have nowhere to store its context.
**Arguments for keeping on the call:** Session persistence is currently implicit in the parent session (which `runUnit`'s non-swarm path uses). Making it an explicit dispatch option would let the runtime manage session caching keyed by `AgentSpec.name`, which is cleaner than the current implicit coupling.
**Proposed resolution for v1:** Defer `persistence: "session"` from the initial implementation. Ship `"ephemeral"` and `"persistent"` only. Session-persistent agents can be revisited when the interactive-session parent hand-off is better understood.
### 6.2 How Does `routing: "capability-match"` Interact with Capability-Based Agent Selection?
`swarm-roles.js` defines agent roles (coordinator, worker, scout, reviewer, planner, verifier, scribe, adversary) and tags (`role:X`, `tier:Y`, `workMode:X`). `AgentSwarm.route(envelope)` selects the target by matching `envelope.workMode` against agent tags.
The `AgentRuntime.registerAgent` API proposed in Section 2.2 accepts a `capabilities: string[]` field. This is forward-looking: future agents may declare fine-grained capabilities (`"can-write-tests"`, `"can-run-shell"`) beyond workMode tags. The routing algorithm would then do a capability intersection rather than a tag prefix match.
**Question:** Should `registerAgent`'s `capabilities` replace the existing tag-based routing in `AgentSwarm.route`, or run in parallel with it?
**Proposed resolution:** In v1, `registerAgent` delegates to `AgentSwarm.register` and `capabilities` is stored as additional tags. The routing algorithm is unchanged. A future step (after decommission) can replace tag-prefix matching with a proper capability-intersection scorer.
### 6.3 Should Errors Propagate or Be Returned as Results?
Current dispatch sites are inconsistent:
- `runSubagent` never throws; it returns `{ ok: false, exitCode, stderr }`.
- `swarmDispatchAndWait` can throw (if the underlying `_busDispatch` throws a routing error) or return `{ reply: null, error: string }`.
- `runAgentTurn` can return `{ error: string }` or throw from `runHeadlessPrompt`.
- `startSliceParallel` returns `{ started, errors }` and never throws.
The proposed `AgentDispatchResult` follows `runSubagent`'s model: **never throw, always return**. All error conditions produce `{ ok: false, exitCode: 1, stderr: reason }`.
**Rationale:** The runtime is called inside long-running orchestrator loops where an unhandled rejection would kill the entire autonomous run. Return-as-result forces callers to inspect the outcome rather than assuming success.
**Exception:** Construction-time errors (e.g., `DefaultAgentRuntime` cannot load the swarm because `.sf/sf.db` is corrupt) may throw, because they indicate a broken environment that the loop cannot recover from.
**Open sub-question:** Should `dispatchBatch` return a `PromiseSettledResult[]`-style array (with per-item `status: "fulfilled" | "rejected"`) or `AgentDispatchResult[]` (where per-item errors are represented as `ok: false`)? The current proposal uses `AgentDispatchResult[]` for uniformity; `PromiseSettledResult` leaks the promise layer into the runtime contract and is harder to test.
### 6.4 Should `onEvent` Be a First-Class Subscription or a Per-Call Callback?
Currently `onEvent` is threaded as a callback through every level of the call stack: `runSubagent``dispatchAndWait``runAgentTurn``runHeadlessPrompt``runSubagent options`. This makes event forwarding brittle — if any layer forgets to forward it, events are silently dropped.
The proposed `runtime.subscribe(callback)` is a first-class subscription on the runtime instance. Any dispatch call on that runtime instance automatically forwards events to all subscribers. Per-call `onEvent` remains available in `DispatchOptions` for callers that want call-scoped filtering (e.g., `runUnitViaSwarm`'s checkpoint-detection collector).
**Open question:** If multiple dispatch calls are in flight concurrently, should the subscriber receive events tagged with the originating `unitId` / `targetAgent`? Without tagging, a subscriber cannot distinguish which of two concurrent dispatches emitted a given `toolcall_end`. The `AgentSessionEvent` does not currently carry a correlation id.
**Proposed resolution:** Add a `correlationId` field to the events emitted by `DefaultAgentRuntime.dispatch`. The correlation id is the `unitId` from the envelope or a generated UUID for direct-spec dispatches. Existing subscribers that don't care about correlation ignore the field.
---
## 7. Relationship to Existing ADRs
### ADR-0000 (Purpose-to-Software Compiler)
The `AgentRuntime` is an implementation detail of step 5 of the compiler pipeline ("generate milestone, slice, task, and artifact contracts from structured state"). It does not change what agents do; it standardizes how they are dispatched. The compiler's product contract — PDD fields, run-control policy, gate runners — is unchanged.
### ADR-0075 (UOK Gate Architecture)
Gate runners currently call `runAgentTurn` indirectly through `swarmDispatchAndWait`. After migration, they will call `runtime.dispatch` with `persistence: "persistent"` and `routing: "capability-match"`. The gate contract (`execute(ctx, attempt) → GateResult`) is unchanged; only the internal dispatch mechanism changes.
### ADR-0079 (Solver / Executor Separation)
The solver pass proposed in ADR-0079 is exactly `runtime.dispatch(solverSpec, executorTranscript, { persistence: "ephemeral", isolation: "in-process", routing: "direct" })`. The solver model selection (`resolveSolverModel`) remains in `model-router.js`; the runtime receives the resolved model via `AgentSpec.model`. The two-pass architecture (executor dispatch → solver dispatch) maps cleanly onto `runtime.dispatch` called twice from `runUnit`.
---
## Appendix A: File Locations After Migration
| New File | Purpose |
|----------|---------|
| `packages/coding-agent/src/core/agent-runtime.ts` | `AgentRuntime` interface + `DefaultAgentRuntime` class |
| `packages/coding-agent/src/core/agent-runtime-worker.ts` | Worker thread entry point for `isolation: "worker-thread"` |
| Modified File | Change |
|---------------|--------|
| `src/headless-triage.ts` | Import `createAgentRuntime` instead of `runSubagent` |
| `src/resources/extensions/sf/auto/run-unit.js` | `runUnitViaSwarm` uses `runtime.dispatch` |
| `src/resources/extensions/sf/subagent/index.js` | `runSingleAgent` + `runSingleAgentViaSwarm` use `runtime.dispatch` |
| `src/resources/extensions/sf/uok/agent-runner.js` | `runHeadlessPrompt` uses `runtime.dispatch` |
| `src/resources/extensions/sf/slice-parallel-orchestrator.js` | `spawnSliceWorker` uses `runtime.dispatch` with `isolation: "subprocess"` |
No existing exported symbols from `subagent-runner.ts`, `swarm-dispatch.js`, or `agent-runner.js` are removed in the migration phase. They become internal implementation details behind the runtime abstraction and are removed only in Step 8.
---
## Appendix B: Invariants the Runtime Must Preserve
1. **Never throw from `dispatch` or `dispatchBatch`** under normal failure conditions (agent timeout, LLM error, routing miss). Return `ok: false`.
2. **`runSubagent` is the sole in-process LLM execution engine.** The runtime wraps it; it does not duplicate its session-management logic.
3. **Inbox refresh contract:** When routing is `"capability-match"` and persistence is `"persistent"`, the runtime forces an inbox refresh before driving the agent turn. This fixes the inbox-state cache bug (Section 1.2) for all callers.
4. **Subprocess isolation guard:** When `isolation: "subprocess"` and `SF_PARALLEL_WORKER` is set in the current process, `dispatch` returns `ok: false` with `stderr: "cannot nest subprocess dispatch from within a parallel worker"`.
5. **Cancellation propagates to the execution engine.** When the caller's `AbortSignal` fires, the runtime cancels the in-flight `runSubagent` call (or the subprocess, or the worker thread) and returns `{ ok: false, exitCode: 1, stderr: "cancelled" }`. It does not leak the in-flight execution.