Mikael Hugo 1478579069 docs: AgentRuntime unification proposal

Design doc for collapsing the five parallel agent-dispatch sites
(defaultAgentRunner, runHeadlessPrompt, runSingleAgent, runUnitViaSwarm,
slice-parallel-orchestrator) onto one runtime with three orthogonal
axes — persistence, isolation, routing.

590 lines, ~5200 words. Includes:
- Problem statement with five concrete pain points from this session's
  swarm convergence rounds (spawn hangs, inbox cache, checkpoint
  synthesis, ledger isolation, etc.)
- Worked-out TypeScript interface
- Mapping of each existing site to runtime options (table)
- 8-step migration plan in blast-radius order (~4-5 days focused work)
- Open questions

No source-code changes. Implementation comes later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-15 06:32:28 +02:00

40 KiB

Raw Blame History

Proposal: Unified `AgentRuntime` Abstraction

Status: Draft
Date: 2026-05-15
Author: Mikael Hugo
Related ADRs: ADR-0000 (purpose-to-software compiler), ADR-0075 (UOK gate architecture), ADR-0079 (solver/executor separation)

1. Problem Statement

1.1 Five Parallel Dispatch Paths

SF currently dispatches agent work through five entirely separate code paths. Each path independently manages cancellation, timeouts, event collection, error representation, and (where relevant) swarm routing. They share no interface, no contract, and no test fixture.

#	Name	File	Line	Mechanism
1	`defaultAgentRunner`	`src/headless-triage.ts`	~363	`runSubagent` directly — composes system prompt via `subagent/prompt-parts`, fires and awaits one `runSubagent` call per triage agent
2	`runHeadlessPrompt` / `runAgentTurn`	`src/resources/extensions/sf/uok/agent-runner.js`	~73 / ~129	`runSubagent` through a two-layer wrapper — `runAgentTurn` reads the PersistentAgent inbox, builds an assembled prompt, calls `runHeadlessPrompt`, then writes a reply back to the `MessageBus`
3	`runSingleAgent` / `runSingleAgentViaSwarm`	`src/resources/extensions/sf/subagent/index.js`	~1266 / ~1140	`runSubagent` directly (default path) or `swarmDispatchAndWait` (flag `SF_SUBAGENT_VIA_SWARM=1`) — used by `/delegate`, `/rubber-duck`, and all interactive subagent commands
4	`runUnitViaSwarm`	`src/resources/extensions/sf/auto/run-unit.js`	~210	`swarmDispatchAndWait` — activated by `SF_AUTONOMOUS_VIA_SWARM=1`; packages the unit prompt as a `DispatchEnvelope` and blocks until the swarm worker replies
5	`startSliceParallel` (spawn)	`src/resources/extensions/sf/slice-parallel-orchestrator.js`	~62	`child_process.spawn` — spawns `sf --mode json --print "/autonomous"` per slice, each in its own git worktree with `SF_SLICE_LOCK` / `SF_MILESTONE_LOCK` set

Path 5 is architecturally distinct from 1–4: it requires a separate process and a separate working directory per worker. This distinction is preserved throughout the proposal.

1.2 Concrete Pain Experienced

Spawn-based dispatch hung intermittently (8 rounds of convergence)

The original runSingleAgent and defaultAgentRunner used child_process.spawn to call sf -p. Every session startup paid full init cost. Spawned processes lost stderr on Windows, produced unpredictable exit timing, and required polling to detect completion. The conversion to in-process runSubagent (paths 1–3) required eight convergence iterations because there was no shared infrastructure: each site had to independently grow timeout logic, cancellation, partial-output collection, and event forwarding.

If there had been a single AgentRuntime interface from the start, those eight iterations would have been one.

Inbox-state cache bug

dispatchAndWait in swarm-dispatch.js delivers a message via SwarmDispatchLayer._busDispatch then immediately calls runAgentTurn. But runAgentTurn reads the agent's inbox from an in-memory cache with a 30-second refresh window. Messages delivered by a different MessageBus instance — which happens whenever dispatchAndWait is called more than once in quick succession — were silently invisible to the agent because the in-memory view hadn't refreshed.

The workaround is the onlyMessageId path (see agent-runner.js:144): when dispatchAndWait supplies onlyMessageId, runAgentTurn forces an inbox refresh before processing. This is a surgical patch that lives in path 2. Path 3 (runSingleAgentViaSwarm) calls swarmDispatchAndWait but does not supply onlyMessageId — it is susceptible to the same bug under concurrent dispatch. Path 4 (runUnitViaSwarm) also calls swarmDispatchAndWait and does not supply onlyMessageId directly; it relies on the onlyMessageId flow inside dispatchAndWait's internal call to runAgentTurn.

A runtime layer that owns the inbox-drain lifecycle would fix this once.

Checkpoint synthesis

When runUnitViaSwarm (path 4) returns, it must synthesize a checkpoint from the agent's reply, because the worker agent ran in an isolated runSubagent session whose tool calls were never written to the parent session's message ledger. The synthesis logic (run-unit.js:459–498) is ~40 lines of heuristics that exist only in path 4. Path 2's runAgentTurn has analogous reply-reconstruction logic. Path 1 has neither. There is no shared "extract structured output from a subagent result" utility.

Parent-ledger isolation

Paths 1–4 all run agents in isolated in-memory sessions (via SessionManager.inMemory()). None of their tool calls appear in the parent session's message history. This is correct, but it means every caller independently re-implements the message extraction pattern (extractFinalOutput in subagent-runner.ts, getFinalOutput in subagent/index.js, reply field on the DispatchResult in path 4). Inconsistencies in these extractors have caused output to be silently dropped when agents produced multi-block assistant messages.

Swarm singleton per process

SwarmDispatchLayer.getOrCreate maintains a module-level cache keyed by ${basePath}:${swarmName}. Paths 3 and 4 both call swarmDispatchAndWait, which calls SwarmDispatchLayer.getOrCreate. There is no mechanism to invalidate this cache, coordinate swarm topology reloads, or observe swarm state from outside paths 3 and 4. Diagnostic tooling has to reach into the internal cache directly.

1.3 The Cost: One Bug Class, Five Surfaces

Every systemic issue in agent dispatch — timeout inconsistency, cancellation leakage, event forwarding gaps, output extraction bugs, inbox refresh misses, checkpoint synthesis failure — surfaces independently in each of the five paths. Fixing path 1 does not fix path 3. The repair loop and the gate runner have no visibility into which path triggered a failure.

2. Proposed Shape

2.1 Three Orthogonal Axes

Agent dispatch has three independently varying dimensions that are currently conflated with the dispatch mechanism itself:

persistence:  "ephemeral" | "session" | "persistent"
isolation:    "in-process" | "worker-thread" | "subprocess"
routing:      "direct" | "broadcast" | "capability-match"

persistence describes the agent's state lifetime:

"ephemeral" — a fresh in-memory session per dispatch; no state carries over. This is the current runSubagent model.
"session" — the agent session persists across multiple prompts within one process lifetime (e.g., the interactive session's parent agent). Not managed by this runtime directly, but callers can hold a session reference.
"persistent" — the agent's identity, inbox, memory, and context blocks survive process restarts, stored in SQLite via PersistentAgent. This is the current swarm model.

isolation describes the execution boundary:

"in-process" — the agent runs synchronously in the same Node.js event loop, sharing the process's module cache and memory heap. Correct for the vast majority of cases.
"worker-thread" — the agent runs in a Node.js Worker thread. Shares the process's lifespan but gets its own V8 heap, so a memory-intensive agent or a crash does not corrupt the parent. Slower to start than in-process; faster than subprocess.
"subprocess" — the agent runs as a completely separate OS process. Required when the agent needs a different working directory (slice-parallel), a different environment, or complete crash isolation with no shared state.

routing describes how a target agent is selected:

"direct" — caller provides an explicit AgentSpec (name, system prompt, tools, model). No topology lookup.
"broadcast" — envelope is delivered to all registered agents matching a tag predicate.
"capability-match" — envelope is delivered to the agent whose registered capabilities best match the envelope's workMode and unitType. This is the current AgentSwarm.route() behavior.

2.2 `AgentRuntime` TypeScript Interface

// packages/coding-agent/src/core/agent-runtime.ts

import type { AgentSessionEvent } from "./agent-session.js";

// ── Agent specification ──────────────────────────────────────────────────────

/** Explicit agent description for "direct" routing. */
export interface AgentSpec {
  name: string;
  systemPrompt: string;
  model?: string;
  tools?: string[];
  cwd?: string;
}

/** Envelope for "capability-match" or "broadcast" routing. */
export interface DispatchEnvelope {
  unitId: string;
  unitType: string;
  workMode: string;
  payload: string;
  priority?: number;
  scope?: string;
  /** Override the matched agent's system prompt. */
  executorSystemPrompt?: string;
  /** Override the matched agent's tool filter. */
  executorTools?: string[];
}

// ── Dispatch options ─────────────────────────────────────────────────────────

export type PersistenceMode = "ephemeral" | "session" | "persistent";
export type IsolationMode   = "in-process" | "worker-thread" | "subprocess";
export type RoutingMode     = "direct" | "broadcast" | "capability-match";

export interface DispatchOptions {
  persistence?: PersistenceMode;
  isolation?:   IsolationMode;
  routing?:     RoutingMode;
  timeoutMs?:   number;
  signal?:      AbortSignal;
  onEvent?:     (event: AgentSessionEvent) => void;
}

// ── Results ──────────────────────────────────────────────────────────────────

export interface AgentDispatchResult {
  ok: boolean;
  output: string;
  stderr?: string;
  exitCode: number;
  /** For persistent agents: the message id the reply was written under. */
  replyMessageId?: string;
  /** For swarm routing: which agent handled the request. */
  targetAgent?: string;
}

// ── Registration ─────────────────────────────────────────────────────────────

/** Persistent-agent topology entry, used for capability-match routing. */
export interface RegisteredAgent {
  name: string;
  role: string;
  tags: string[];
  capabilities: string[];
  basePath: string;
}

// ── Runtime interface ────────────────────────────────────────────────────────

export interface AgentRuntime {
  /**
   * Dispatch a task to one agent and await its result.
   *
   * When routing is "direct", `target` must be an AgentSpec.
   * When routing is "capability-match", `target` must be a DispatchEnvelope.
   * When routing is "broadcast", `target` must be a DispatchEnvelope; returns
   * the first non-error result and cancels the rest.
   */
  dispatch(
    target: AgentSpec | DispatchEnvelope,
    message: string,
    options?: DispatchOptions,
  ): Promise<AgentDispatchResult>;

  /**
   * Dispatch multiple tasks concurrently and await all results.
   * The same options apply to all tasks in the batch.
   */
  dispatchBatch(
    tasks: Array<{ target: AgentSpec | DispatchEnvelope; message: string }>,
    options?: DispatchOptions,
  ): Promise<AgentDispatchResult[]>;

  /**
   * Subscribe to all events emitted by agents dispatched through this runtime.
   * Returns an unsubscribe function.
   */
  subscribe(callback: (event: AgentSessionEvent) => void): () => void;

  /**
   * Register an agent in the persistent topology so capability-match routing
   * can discover it.
   */
  registerAgent(agent: RegisteredAgent): void;

  /**
   * Return all registered agents. Used by diagnostic tooling and /sf swarm status.
   */
  getRegisteredAgents(): RegisteredAgent[];
}

2.3 Default Option Values

When options are omitted, the runtime applies context-sensitive defaults:

Field	Default	Rationale
`persistence`	`"ephemeral"`	Matches current `runSubagent` behavior; safe default for all existing callers
`isolation`	`"in-process"`	Fastest, fewest moving parts; subprocess spawning only when explicitly requested
`routing`	`"direct"`	Existing callers always have an explicit agent spec; capability-match is opt-in
`timeoutMs`	`480_000` (8 min)	Matches current `dispatchAndWait` default

2.4 Existing Dispatch Sites Mapped to Runtime Options

Site	`persistence`	`isolation`	`routing`	Notes
`defaultAgentRunner` (headless-triage.ts)	`"ephemeral"`	`"in-process"`	`"direct"`	Triage agents are stateless; direct spec with prompt-parts composition
`runHeadlessPrompt` / `runAgentTurn` (agent-runner.js)	`"persistent"`	`"in-process"`	`"capability-match"`	PersistentAgent inbox drain; runtime owns the bus write-back
`runSingleAgent` (subagent/index.js)	`"ephemeral"`	`"in-process"`	`"direct"`	Interactive subagents; no state retention across calls
`runUnitViaSwarm` (run-unit.js)	`"persistent"`	`"in-process"`	`"capability-match"`	Autonomous unit dispatch; runtime owns checkpoint synthesis
`startSliceParallel` (slice-parallel-orchestrator.js)	`"ephemeral"`	`"subprocess"`	`"direct"`	Each worker needs its own cwd and git worktree; subprocess is load-bearing

2.5 What Stays the Same vs. What Changes

Stays the same:

runSubagent in packages/coding-agent/src/core/subagent-runner.ts — the runtime uses it as the in-process execution engine. Its interface and behavior are unchanged.
swarmDispatchAndWait / SwarmDispatchLayer — the runtime uses these as the persistent+capability-match backend. They remain as implementation details, not public API for dispatch callers.
AgentSwarm, PersistentAgent, MessageBus — unchanged. The runtime coordinates through them, not around them.
The DispatchEnvelope shape — the runtime accepts it as-is for capability-match routing.
Checkpoint tool protocol — unchanged. The runtime collects events and synthesizes checkpoints exactly as runUnitViaSwarm does today, but in a shared utility.

Changes for callers:

Callers import createAgentRuntime() instead of importing runSubagent or swarmDispatchAndWait directly.
Error representation becomes uniform: AgentDispatchResult.ok is always a boolean; stderr carries the reason; exitCode follows the existing convention (0/1/124).
Event subscription uses runtime.subscribe() rather than threading an onEvent callback through every call.
The swarm singleton lifecycle is owned by the runtime instance, not by the SwarmDispatchLayer module-level cache.

3. Isolation Tiers

3.1 In-Process (`"in-process"`)

The default tier. The agent session is created via createAgentSession with SessionManager.inMemory(), same as runSubagent today. Startup cost is ~50 ms (resource loader + session init). Memory is shared with the parent but the session state is isolated to the subagent.

Right for: triage agents, interactive subagents, swarm workers executing unit tasks, solver pass.

Wrong for: agents that crash unpredictably (they take down the parent process), agents that need a different cwd per call (process.cwd is global), agents that must run concurrently with heavy memory footprint.

3.2 Worker Thread (`"worker-thread"`)

Node.js worker_threads.Worker gives each agent its own V8 heap and a clean module scope, while still sharing the process lifespan and startup overhead (no OS process fork). Communication is via MessageChannel. Worker threads do not share process.cwd() — each worker can set its own cwd via the workerData initialization option.

Right for:

Agents that might OOM (large context windows, image processing).
Agents loaded from untrusted extension paths where a crash must not kill the parent.
Concurrent agent execution where memory isolation matters but subprocess overhead is unacceptable.
Proving the isolation boundary in tests: a worker-thread agent can be killed cleanly without tearing down the test process.

Wrong for: agents that must share module-level singletons with the parent (e.g., the interactive session's AgentSwarm singleton). Worker threads cannot share JS objects across the MessageChannel boundary — only structured-clone-serializable data.

Implementation sketch:

// In agent-runtime.ts, isolation: "worker-thread" branch:
const { Worker, MessageChannel } = await import("node:worker_threads");
const { port1, port2 } = new MessageChannel();
const worker = new Worker(
  new URL("./agent-runtime-worker.js", import.meta.url),
  { workerData: { spec, message, options: { ...options, port: port2 } },
    transferList: [port2] },
);
// agent-runtime-worker.js calls runSubagent and posts the result back via port1.

The runtime thread drives the worker via port1 events; the worker runs runSubagent in its own thread.

3.3 Subprocess (`"subprocess"`)

child_process.spawn of sf --mode json --print "/autonomous" (or a future sf headless agent-turn) with the caller-supplied cwd, environment, and lock variables. This is the current slice-parallel-orchestrator.js approach, which is correct and must be preserved.

Right for: slice-parallel execution where each worker needs process.cwd() anchored to a different git worktree. The working directory is not just a parameter — it determines which git index, which .sf/sf.db, and which lock context the agent operates in. In-process and worker-thread isolation cannot replicate this.

Wrong for: anything that doesn't need a distinct filesystem root. Subprocess startup takes 2–5 seconds (full Node.js + SF init), not 50 ms.

Note on the subprocess in-process flag (SF_PARALLEL_WORKER): The slice orchestrator already guards against nesting via process.env.SF_PARALLEL_WORKER. The "subprocess" isolation tier will preserve this guard.

3.4 Decision Tree

Does the agent need a different git worktree / cwd?
  YES → "subprocess"

Could a crash or OOM kill the parent?
  YES (untrusted extension, large context) → "worker-thread"

Default → "in-process"

4. Migration Plan

Each step is self-contained: at the end of the step, the named dispatch site works correctly through the runtime and the old code path is removable. Steps are ordered by blast radius (smallest first).

Step 1: Define `AgentRuntime` in `packages/coding-agent/src/core/agent-runtime.ts`

What: Create the interface (Section 2.2), create a DefaultAgentRuntime class that wraps runSubagent (direct, in-process, ephemeral path) and SwarmDispatchLayer.dispatchAndWait (capability-match, in-process, persistent path). No existing call sites are changed.

Effort: ~1 day. The two existing engines are already well-factored; this is wiring.

Blast radius: Zero. The file is new; no imports change.

What proves it works:

Unit test: runtime.dispatch(spec, "hello") → result.ok === true, result.output is non-empty.
Unit test: runtime.dispatch(envelope, "hello", { routing: "capability-match" }) → result.targetAgent is set.
Unit test: runtime.dispatchBatch([...]) → array of results, length matches input.

Landmines: None at this step. The class is additive.

Step 2: Migrate `defaultAgentRunner` (headless-triage.ts)

What: Replace the inline runSubagent call at headless-triage.ts:~363 with runtime.dispatch(spec, task, { persistence: "ephemeral", isolation: "in-process", routing: "direct" }). The runtime instance is created at the top of handleTriageRun (or injected via HandleTriageOptions.agentRunner which already exists as an escape hatch).

Effort: ~2 hours.

Blast radius: Smallest of all sites. headless-triage.ts is called only from sf headless triage, not from the autonomous loop. A regression here affects only the triage command, which is manually invocable.

What proves it works:

Existing triage integration test: runTriageApply produces a valid ParseTriagePlanResult.
The agentRunner injection point in HandleTriageOptions remains so tests can still inject a mock runner.

Landmines:

defaultAgentRunner is also passed as agentRunner to runTriageApply by some callers; the new runtime must expose the same (agent, task, options) => Promise<AgentRunResult> call signature (or adapt the RunTriageApplyResult type). Consider keeping the AgentRunner function type as a thin shim over runtime.dispatch during transition.

Step 3: Migrate `runUnitViaSwarm` (run-unit.js)

What: runUnitViaSwarm at run-unit.js:~210 already calls swarmDispatchAndWait. Replace it with runtime.dispatch(envelope, prompt, { persistence: "persistent", isolation: "in-process", routing: "capability-match", timeoutMs, onEvent }). The event collector (onEvent) and checkpoint synthesis stay in run-unit.js — they are unit-orchestration concerns, not runtime concerns.

Effort: ~2 hours.

Blast radius: Low. This path is only active when SF_AUTONOMOUS_VIA_SWARM=1. The default autonomous path (pi.sendMessage) is unaffected.

What proves it works:

The existing auto-dispatch-canonical-plan.test.mjs (currently modified, per git status) should pass.
result.targetAgent and result.replyMessageId remain populated so run-unit.js's outcome extraction works.

Landmines:

runUnitViaSwarm reads envelope.executorSystemPrompt and envelope.executorTools and passes them through dispatchAndWait → runAgentTurn → runHeadlessPrompt. The runtime must either thread these through the DispatchEnvelope unchanged (preferred — the envelope is already the contract) or expose them as explicit dispatch options.

Step 4: Migrate `runSingleAgent` (subagent/index.js)

What: runSingleAgent at subagent/index.js:~1266 calls runSubagent directly. runSingleAgentViaSwarm at ~1140 calls swarmDispatchAndWait. Both are replaced by runtime.dispatch(spec, task, { persistence: "ephemeral", isolation: "in-process", routing: "direct" }) with the swarm variant using routing: "capability-match" and persistence: "persistent".

The SF_SUBAGENT_VIA_SWARM flag becomes a runtime option instead of a code branch: callers that want swarm routing pass { routing: "capability-match" }.

Effort: ~4 hours. This is the most complex call site because of runSingleAgentInCmuxSplit, CMUX-split coordination, and the liveSubagentControllers abort set.

Blast radius: Medium. This path handles all interactive /delegate and /rubber-duck commands. A regression is user-visible immediately.

What proves it works:

Interactive test: /delegate rubber-duck "explain X" in an interactive SF session produces output.
Unit test: runSingleAgent with an unknown agent name returns exitCode: 1 with the correct stderr message.
liveSubagentControllers must still be populated on dispatch so abortAllSubagents() works.

Landmines:

processSubagentEventLine processes raw event JSON and updates currentResult in place. The runtime's onEvent callback receives AgentSessionEvent objects (not raw JSON strings). The event processing logic must be updated to accept typed events, or the shim must re-serialize them.
CMUX split creates two concurrent runSingleAgent calls. The runtime must be safe for concurrent dispatch from the same instance.

Step 5: Migrate `runAgentTurn` / `runHeadlessPrompt` (agent-runner.js)

What: runHeadlessPrompt at agent-runner.js:~73 calls runSubagent. Replace with runtime.dispatch(spec, prompt, { persistence: "ephemeral", isolation: "in-process", routing: "direct" }). The runAgentTurn orchestration (inbox read, context assembly, bus write-back) stays in agent-runner.js — those are PersistentAgent lifecycle concerns.

Effort: ~3 hours.

Blast radius: Medium-high. This path drives the swarm-agent LLM turns for all dispatchAndWait calls (paths 3 and 4 after Steps 3 and 4 complete). A regression here silently produces empty swarm replies.

What proves it works:

runAgentTurn with a mock agent that has one unread inbox message → turnsProcessed === 1, response is non-empty, bus contains a reply message.
onlyMessageId path still forces inbox refresh.

Landmines:

systemPromptOverride and toolsOverride are currently threaded through opts all the way to runSubagent. The runtime must expose these either via AgentSpec fields (where they belong) or via dispatch options. Prefer AgentSpec — the spec is the per-call agent definition.

Step 6: Add `worker-thread` Isolation Option

What: Implement the "worker-thread" isolation path in DefaultAgentRuntime.dispatch. Create packages/coding-agent/src/core/agent-runtime-worker.ts as the worker entry point. It imports runSubagent and posts results back via MessageChannel.

Effort: ~1 day. Worker thread communication is straightforward for structured-clone-safe data; AgentDispatchResult is trivially serializable.

Blast radius: Zero to existing callers. No existing site uses isolation: "worker-thread". This step is purely additive.

What proves it works:

Synthetic test: dispatch with isolation: "worker-thread" → result.ok === true, result matches an equivalent in-process dispatch.
Crash-isolation test: worker that throws after returning a partial result → parent process continues, result.ok === false, result.stderr contains the error message.
Memory-isolation test: worker allocates a large buffer → parent heap is unaffected.

Landmines:

AgentSessionEvent objects may contain non-serializable fields (functions, class instances). The worker must emit only the serializable parts via postMessage. Define an AgentSessionEventSerializable subset if needed.
getAgentDir() uses process.env which is inherited by worker threads. Verify that the worker correctly resolves the agent directory.

Step 7: Migrate Slice-Parallel to Runtime with `isolation: "subprocess"`

What: Replace the raw child_process.spawn in slice-parallel-orchestrator.js:~131 (spawnSliceWorker) with runtime.dispatch(spec, "/autonomous", { isolation: "subprocess", persistence: "ephemeral", routing: "direct", ... }) where spec.cwd is the slice's worktree path.

The subprocess isolation tier wraps child_process.spawn of sf --mode json --print "/autonomous" with the same env vars (SF_SLICE_LOCK, SF_MILESTONE_LOCK, SF_PARALLEL_WORKER=1) and worktree cwd.

Effort: ~1 day. The subprocess mechanics are already correct in spawnSliceWorker; this step is wrapping them in the runtime interface so cancellation, budget tracking, and event forwarding use the same path as other tiers.

Blast radius: Medium. Slice-parallel is an infrequently exercised path (requires two or more conflict-free slices and the parallel flag). A regression is visible in parallel autonomous runs.

What proves it works:

startSliceParallel integration test: spawns two workers, both reach a completed state, orchestrator state reflects both.
SF_PARALLEL_WORKER guard still fires: calling runtime.dispatch(... isolation: "subprocess") from within a subprocess worker returns an error rather than nesting.

Landmines:

sliceState.workers currently stores raw ChildProcess references for stopSliceParallel to call .kill() on. The runtime must expose an abort mechanism (the existing AbortController / AbortSignal pattern) so stopSliceParallel can cancel via signal rather than raw pid management.
Budget tracking (worker.cost) currently requires parsing NDJSON from the subprocess stdout. The runtime should expose a cost-update event so the orchestrator can update sliceState.totalCost without parsing raw output.

Step 8: Decommission Old Dispatch Sites

What: After Steps 2–7 are verified in production, remove:

The direct runSubagent import from headless-triage.ts (Step 2).
The direct swarmDispatchAndWait import from run-unit.js (Step 3).
The direct runSubagent and swarmDispatchAndWait imports from subagent/index.js (Step 4).
runHeadlessPrompt in agent-runner.js (Step 5).
The SF_SUBAGENT_VIA_SWARM and SF_AUTONOMOUS_VIA_SWARM feature flags (both paths now use the runtime with the appropriate options).
spawnSliceWorker raw spawn logic (Step 7).

Effort: ~2 hours (deleting code and updating imports).

Blast radius: Zero at this point — the sites have already been migrated and tests are green.

Landmines:

SF_SUBAGENT_VIA_SWARM and SF_AUTONOMOUS_VIA_SWARM may be set in operator configs or CI environments. Add a deprecation log before removing them. The new routing behavior should be configurable via AgentRuntime options set at construction time, not via environment variable branches.

Migration Summary

Step	Description	Effort	Blast Radius
1	Define `AgentRuntime` interface + `DefaultAgentRuntime`	1 day	Zero
2	Migrate `defaultAgentRunner` (headless-triage)	2 h	Lowest
3	Migrate `runUnitViaSwarm` (run-unit)	2 h	Low
4	Migrate `runSingleAgent` (subagent/index)	4 h	Medium
5	Migrate `runAgentTurn` / `runHeadlessPrompt` (agent-runner)	3 h	Medium-high
6	Add `worker-thread` isolation tier	1 day	Zero (additive)
7	Migrate slice-parallel to `isolation: "subprocess"`	1 day	Medium
8	Decommission old dispatch sites	2 h	Zero (deletions)

Total: 8 steps, ~4–5 days of focused work.

5. Non-Goals

5.1 `AgentRuntime` Is Not a Model Router

Model selection for a given unit type is the job of model-router.js. The runtime passes the caller's AgentSpec.model field through to runSubagent's SubagentConfig.model field unchanged. If the caller does not specify a model, the runtime uses the session default, exactly as runSubagent does today.

The runtime does not invoke computeTaskRequirements, selectModelForUnit, or the Bayesian blender. Model routing is a concern of the dispatch caller, not the dispatch mechanism.

5.2 `AgentRuntime` Is Not a Tool Registry

Tool availability is determined by the resource loader and session.setActiveToolsByName. The runtime passes AgentSpec.tools through to SubagentConfig.tools unchanged. It does not discover, load, or manage tools.

5.3 `AgentRuntime` Is Not a Scheduler

The autonomous orchestrator (auto/loop.js) owns the decision of when to dispatch units, in what order, and with what retry policy. The runtime provides the mechanism of dispatch: given a target and a message, execute and return a result. Scheduling, gate running, phase management, and repair logic remain in the orchestrator.

5.4 `AgentRuntime` Does Not Replace `PersistentAgent` Lifecycle

PersistentAgent, AgentSwarm, and MessageBus manage the persistent agent topology. The runtime uses these structures; it does not replace them. Agents are still registered via AgentSwarm.register (and exposed through runtime.registerAgent as a thin delegation). The runtime is a dispatch coordinator, not a topology manager.

5.5 `AgentRuntime` Does Not Wrap the Interactive Session

The user's interactive session (the parent AgentSession) is not managed by AgentRuntime. The runtime is for dispatching subordinate agents, not for managing the top-level conversation. pi.sendMessage in runUnit's non-swarm path is out of scope.

6. Open Questions

6.1 Should `persistence: "session"` be a separate axis or coupled to agent identity?

The current proposal defines "session" as a persistence mode meaning "the session persists within a single process lifetime but is not written to disk." This is distinct from both "ephemeral" (fresh session per dispatch) and "persistent" (SQLite-backed PersistentAgent).

The question is whether "session persistence" is a property of the dispatch call (as proposed) or a property of the agent's registered identity (i.e., only agents explicitly registered as session-persistent can be dispatched with persistence: "session").

Arguments for coupling to identity: An agent that accumulates session context needs to be found again across calls — which requires registration. An anonymous AgentSpec with persistence: "session" would have nowhere to store its context.

Arguments for keeping on the call: Session persistence is currently implicit in the parent session (which runUnit's non-swarm path uses). Making it an explicit dispatch option would let the runtime manage session caching keyed by AgentSpec.name, which is cleaner than the current implicit coupling.

Proposed resolution for v1: Defer persistence: "session" from the initial implementation. Ship "ephemeral" and "persistent" only. Session-persistent agents can be revisited when the interactive-session parent hand-off is better understood.

6.2 How Does `routing: "capability-match"` Interact with Capability-Based Agent Selection?

swarm-roles.js defines agent roles (coordinator, worker, scout, reviewer, planner, verifier, scribe, adversary) and tags (role:X, tier:Y, workMode:X). AgentSwarm.route(envelope) selects the target by matching envelope.workMode against agent tags.

The AgentRuntime.registerAgent API proposed in Section 2.2 accepts a capabilities: string[] field. This is forward-looking: future agents may declare fine-grained capabilities ("can-write-tests", "can-run-shell") beyond workMode tags. The routing algorithm would then do a capability intersection rather than a tag prefix match.

Question: Should registerAgent's capabilities replace the existing tag-based routing in AgentSwarm.route, or run in parallel with it?

Proposed resolution: In v1, registerAgent delegates to AgentSwarm.register and capabilities is stored as additional tags. The routing algorithm is unchanged. A future step (after decommission) can replace tag-prefix matching with a proper capability-intersection scorer.

6.3 Should Errors Propagate or Be Returned as Results?

Current dispatch sites are inconsistent:

runSubagent never throws; it returns { ok: false, exitCode, stderr }.
swarmDispatchAndWait can throw (if the underlying _busDispatch throws a routing error) or return { reply: null, error: string }.
runAgentTurn can return { error: string } or throw from runHeadlessPrompt.
startSliceParallel returns { started, errors } and never throws.

The proposed AgentDispatchResult follows runSubagent's model: never throw, always return. All error conditions produce { ok: false, exitCode: 1, stderr: reason }.

Rationale: The runtime is called inside long-running orchestrator loops where an unhandled rejection would kill the entire autonomous run. Return-as-result forces callers to inspect the outcome rather than assuming success.

Exception: Construction-time errors (e.g., DefaultAgentRuntime cannot load the swarm because .sf/sf.db is corrupt) may throw, because they indicate a broken environment that the loop cannot recover from.

Open sub-question: Should dispatchBatch return a PromiseSettledResult[]-style array (with per-item status: "fulfilled" | "rejected") or AgentDispatchResult[] (where per-item errors are represented as ok: false)? The current proposal uses AgentDispatchResult[] for uniformity; PromiseSettledResult leaks the promise layer into the runtime contract and is harder to test.

6.4 Should `onEvent` Be a First-Class Subscription or a Per-Call Callback?

Currently onEvent is threaded as a callback through every level of the call stack: runSubagent → dispatchAndWait → runAgentTurn → runHeadlessPrompt → runSubagent options. This makes event forwarding brittle — if any layer forgets to forward it, events are silently dropped.

The proposed runtime.subscribe(callback) is a first-class subscription on the runtime instance. Any dispatch call on that runtime instance automatically forwards events to all subscribers. Per-call onEvent remains available in DispatchOptions for callers that want call-scoped filtering (e.g., runUnitViaSwarm's checkpoint-detection collector).

Open question: If multiple dispatch calls are in flight concurrently, should the subscriber receive events tagged with the originating unitId / targetAgent? Without tagging, a subscriber cannot distinguish which of two concurrent dispatches emitted a given toolcall_end. The AgentSessionEvent does not currently carry a correlation id.

Proposed resolution: Add a correlationId field to the events emitted by DefaultAgentRuntime.dispatch. The correlation id is the unitId from the envelope or a generated UUID for direct-spec dispatches. Existing subscribers that don't care about correlation ignore the field.

7. Relationship to Existing ADRs

ADR-0000 (Purpose-to-Software Compiler)

The AgentRuntime is an implementation detail of step 5 of the compiler pipeline ("generate milestone, slice, task, and artifact contracts from structured state"). It does not change what agents do; it standardizes how they are dispatched. The compiler's product contract — PDD fields, run-control policy, gate runners — is unchanged.

ADR-0075 (UOK Gate Architecture)

Gate runners currently call runAgentTurn indirectly through swarmDispatchAndWait. After migration, they will call runtime.dispatch with persistence: "persistent" and routing: "capability-match". The gate contract (execute(ctx, attempt) → GateResult) is unchanged; only the internal dispatch mechanism changes.

ADR-0079 (Solver / Executor Separation)

The solver pass proposed in ADR-0079 is exactly runtime.dispatch(solverSpec, executorTranscript, { persistence: "ephemeral", isolation: "in-process", routing: "direct" }). The solver model selection (resolveSolverModel) remains in model-router.js; the runtime receives the resolved model via AgentSpec.model. The two-pass architecture (executor dispatch → solver dispatch) maps cleanly onto runtime.dispatch called twice from runUnit.

Appendix A: File Locations After Migration

New File	Purpose
`packages/coding-agent/src/core/agent-runtime.ts`	`AgentRuntime` interface + `DefaultAgentRuntime` class
`packages/coding-agent/src/core/agent-runtime-worker.ts`	Worker thread entry point for `isolation: "worker-thread"`

Modified File	Change
`src/headless-triage.ts`	Import `createAgentRuntime` instead of `runSubagent`
`src/resources/extensions/sf/auto/run-unit.js`	`runUnitViaSwarm` uses `runtime.dispatch`
`src/resources/extensions/sf/subagent/index.js`	`runSingleAgent` + `runSingleAgentViaSwarm` use `runtime.dispatch`
`src/resources/extensions/sf/uok/agent-runner.js`	`runHeadlessPrompt` uses `runtime.dispatch`
`src/resources/extensions/sf/slice-parallel-orchestrator.js`	`spawnSliceWorker` uses `runtime.dispatch` with `isolation: "subprocess"`

No existing exported symbols from subagent-runner.ts, swarm-dispatch.js, or agent-runner.js are removed in the migration phase. They become internal implementation details behind the runtime abstraction and are removed only in Step 8.

Appendix B: Invariants the Runtime Must Preserve

Never throw from dispatch or dispatchBatch under normal failure conditions (agent timeout, LLM error, routing miss). Return ok: false.
runSubagent is the sole in-process LLM execution engine. The runtime wraps it; it does not duplicate its session-management logic.
Inbox refresh contract: When routing is "capability-match" and persistence is "persistent", the runtime forces an inbox refresh before driving the agent turn. This fixes the inbox-state cache bug (Section 1.2) for all callers.
Subprocess isolation guard: When isolation: "subprocess" and SF_PARALLEL_WORKER is set in the current process, dispatch returns ok: false with stderr: "cannot nest subprocess dispatch from within a parallel worker".
Cancellation propagates to the execution engine. When the caller's AbortSignal fires, the runtime cancels the in-flight runSubagent call (or the subprocess, or the worker thread) and returns { ok: false, exitCode: 1, stderr: "cancelled" }. It does not leak the in-flight execution.

40 KiB Raw Blame History Unescape Escape

Proposal: Unified AgentRuntime Abstraction

1. Problem Statement

1.1 Five Parallel Dispatch Paths

1.2 Concrete Pain Experienced

Spawn-based dispatch hung intermittently (8 rounds of convergence)

Inbox-state cache bug

Checkpoint synthesis

Parent-ledger isolation

Swarm singleton per process

1.3 The Cost: One Bug Class, Five Surfaces

2. Proposed Shape

2.1 Three Orthogonal Axes

2.2 AgentRuntime TypeScript Interface

2.3 Default Option Values

2.4 Existing Dispatch Sites Mapped to Runtime Options

2.5 What Stays the Same vs. What Changes

3. Isolation Tiers

3.1 In-Process ("in-process")

3.2 Worker Thread ("worker-thread")

3.3 Subprocess ("subprocess")

3.4 Decision Tree

4. Migration Plan

Step 1: Define AgentRuntime in packages/coding-agent/src/core/agent-runtime.ts

Step 2: Migrate defaultAgentRunner (headless-triage.ts)

Step 3: Migrate runUnitViaSwarm (run-unit.js)

Step 4: Migrate runSingleAgent (subagent/index.js)

Step 5: Migrate runAgentTurn / runHeadlessPrompt (agent-runner.js)

Step 6: Add worker-thread Isolation Option

Step 7: Migrate Slice-Parallel to Runtime with isolation: "subprocess"

Step 8: Decommission Old Dispatch Sites

Migration Summary

5. Non-Goals

5.1 AgentRuntime Is Not a Model Router

5.2 AgentRuntime Is Not a Tool Registry

5.3 AgentRuntime Is Not a Scheduler

5.4 AgentRuntime Does Not Replace PersistentAgent Lifecycle

5.5 AgentRuntime Does Not Wrap the Interactive Session

6. Open Questions

6.1 Should persistence: "session" be a separate axis or coupled to agent identity?

6.2 How Does routing: "capability-match" Interact with Capability-Based Agent Selection?

6.3 Should Errors Propagate or Be Returned as Results?

6.4 Should onEvent Be a First-Class Subscription or a Per-Call Callback?

7. Relationship to Existing ADRs

ADR-0000 (Purpose-to-Software Compiler)

ADR-0075 (UOK Gate Architecture)

ADR-0079 (Solver / Executor Separation)

Appendix A: File Locations After Migration

Appendix B: Invariants the Runtime Must Preserve

40 KiB

Raw Blame History

Proposal: Unified `AgentRuntime` Abstraction

2.2 `AgentRuntime` TypeScript Interface

3.1 In-Process (`"in-process"`)

3.2 Worker Thread (`"worker-thread"`)

3.3 Subprocess (`"subprocess"`)

Step 1: Define `AgentRuntime` in `packages/coding-agent/src/core/agent-runtime.ts`

Step 2: Migrate `defaultAgentRunner` (headless-triage.ts)

Step 3: Migrate `runUnitViaSwarm` (run-unit.js)

Step 4: Migrate `runSingleAgent` (subagent/index.js)

Step 5: Migrate `runAgentTurn` / `runHeadlessPrompt` (agent-runner.js)

Step 6: Add `worker-thread` Isolation Option

Step 7: Migrate Slice-Parallel to Runtime with `isolation: "subprocess"`

5.1 `AgentRuntime` Is Not a Model Router

5.2 `AgentRuntime` Is Not a Tool Registry

5.3 `AgentRuntime` Is Not a Scheduler

5.4 `AgentRuntime` Does Not Replace `PersistentAgent` Lifecycle

5.5 `AgentRuntime` Does Not Wrap the Interactive Session

6.1 Should `persistence: "session"` be a separate axis or coupled to agent identity?

6.2 How Does `routing: "capability-match"` Interact with Capability-Based Agent Selection?

6.4 Should `onEvent` Be a First-Class Subscription or a Per-Call Callback?