Mikael Hugo d7c2663ca5 sf snapshot: uncommitted changes after 113m inactivity

2026-05-08 17:44:49 +02:00

24 KiB

Raw Blame History

Dispatch/Orchestration Architecture — Consolidation Plan

Author: Research synthesis
Date: 2026-05-08
Status: Draft — for review and promotion

1. Root Cause Diagnosis — Why Did This Proliferation Happen?

The five dispatch mechanisms + 1 message bus grew to fill genuine gaps, not from poor design. But the structural symptom is consistent with every system that accumulates dispatch primitives without a unifying abstraction: there is no single concept that unifies them.

Each addition was driven by a real gap at a different time:

Mechanism	Gap filled	Structural symptom
subagent tool (`extensions/subagent/index.js`)	Ad-hoc delegation from within a TUI/headless session	First-class spawning of a full CLI process via `spawn()`; only 4 tools registered; no DB tools
parallel-orchestrator (`parallel-orchestrator.js`)	True parallel milestone execution with git worktree isolation	Mirrors subagent's `spawn` pattern but at milestone scope with session status files, cost accumulation, and file-intent tracking
slice-parallel-orchestrator (`slice-parallel-orchestrator.js`)	Slice-level parallelism within a milestone	Copy-paste of parallel-orchestrator with scope changed; ~90% identical code
UOK kernel (`uok/kernel.js`)	Deterministic autonomous loop with gates, observability, parity reporting	Grew into the central orchestration engine but does not subsume the dispatch primitives below it
MessageBus (`uok/message-bus.js`)	Durable SQLite-backed inter-agent messaging for multi-agent coordination	Modeled on Letta's SQLite-backed messaging; lives in UOK but is not wired into subagent or parallel-orchestrator dispatch paths
Cmux (`cmux/index.js`)	RPC multiplexing and terminal surface integration	Orthogonal to dispatch — a UI/surface concern, not an orchestration concern

The Concretion

Three missing abstractions drove the proliferation:

No unified "dispatch context" — subagent, parallel-orchestrator, and UOK each create their own notion of "what am I running and with what environment." The result is three different spawn patterns, three different ways of tracking cost, and no shared vocabulary.
No shared dispatch registry — there is no single place that tracks "what is currently running" across all parallelism dimensions. The parallel orchestrator tracks milestone workers via session status files; the slice-parallel orchestrator tracks slice workers separately; subagent tracks spawned processes in a Set. These are not unified.
No first-class "work unit" concept — milestone, slice, and task are different tables with different lock semantics, not different states of the same work unit. This is why the slice-parallel orchestrator had to be a near-total copy of the milestone orchestrator rather than a parameterization.

The UOK kernel was designed as a single-agent loop. It runs inside the headless process and manages one autonomous run. It does not know about sibling workers, does not coordinate with the parallel orchestrator, and does not have a model for "I am one of N workers running concurrently."

Subagent tool was never designed to integrate with SF's state. It spawns sf CLI which is a full binary with its own extension registration. It cannot call SF tools like complete-task or plan-slice because those are registered in the headless RPC path, not in the subagent's spawned CLI context. The 4 registered tools are intentionally narrow to avoid dangerous nested dispatch.

MessageBus was designed for persistent agents, but SF doesn't have persistent agents yet. The Letta-style inbox model is architecturally correct but premature — you need durable named agents before durable named inboxes matter. Today the MessageBus is used for UOK internal observer chains but not for real multi-agent coordination.

The `adversarial_partner/combatant/architect` Fields

These DB fields (in slices table, sf-db.js) are planning ceremony fields, not dispatch mechanism fields. They belong in the PDD planning layer and are rendered in markdown-renderer.js and workflow-projections.js as "Partner Review", "Combatant Review", and "Architect Review" sections in slice output. They have nothing to do with the dispatch layer — they are populated by planning tools, not by dispatch.

2. What Should Stay vs Merge

Stay (genuinely different concerns)

Mechanism	Reason to Keep
subagent tool (`extensions/subagent/index.js`)	Ad-hoc in-session delegation. The 4-tool surface (`subagent`, `await_subagent`, `cancel_subagent`, `/subagent` command) is the right interface for human-in-the-loop or autonomous session agents that need to spin up a helper without leaving their context. The restriction to only those 4 tools is intentional and correct. The subagent spawns `sf --mode json` (not `sf headless`), which is correct for its shorter-lived, interactive nature.
UOK kernel (`uok/kernel.js`, `uok/index.js`)	The deterministic autonomous loop with gate evaluation, parity reporting, audit envelopes, and run-control policy. This is the controller in the architecture sense. It decides what to run next; it does not implement how to run it. The `runAutoLoopWithUok` function is correctly scoped.
MessageBus (`uok/message-bus.js`)	Durable SQLite-backed inter-agent messaging. The `send`, `broadcast`, `sendOnce`, `getConversation`, and `AgentInbox` primitives are genuinely useful for multi-agent coordination. The Letta-style design is sound. The problem is it is not wired into the dispatch path — agents spawned by subagent or parallel-orchestrator cannot use it.
Cmux (`cmux/index.js`)	RPC multiplexing and terminal surface integration. Orthogonal to dispatch — a UI/surface concern, not an orchestration concern. Correctly scoped as a UI/shell concern.
Execution graph (`uok/execution-graph.js`)	The file-conflict DAG that computes which milestones/slices can run in parallel. This is the constraint solver — it knows about file overlaps but not about process lifecycle.
CoordinationStore (`uok/coordination-store.js`)	Redis-like primitives (TTL KV, streams, lease-based queues) on SQLite. Right building block for durable background coordination without a server process.

Merge (duplication with no semantic difference)

Duplicated	Problem	Resolution
`parallel-orchestrator.js` + `slice-parallel-orchestrator.js`	~90% identical code. The only meaningful differences: scope (milestone vs slice), lock env vars (`SF_MILESTONE_LOCK` vs `SF_SLICE_LOCK` + `SF_MILESTONE_LOCK`), and status file naming (`milestoneId` vs `milestoneId/sliceId`). The conflict detection, worktree management, worker lifecycle, NDJSON parsing, and cost tracking are copy-pasted.	Merge into a single `WorktreeOrchestrator` class parameterized by `{ scope: 'milestone'

Refactor (same need, wrong implementation)

Current	Issue	Refactor
subagent spawning `sf` CLI	The subagent tool spawns `sf` CLI as a full binary. The 4-tool limitation is enforced by not registering other tools, not by a principled access model.	Keep spawning `sf` CLI for security isolation, but formalize the access contract explicitly. See section 5.
parallel-orchestrator + slice-parallel using file-based IPC	Workers coordinate via `session-status-io.js` (filesystem polling) and `sendSignal`. This is a hand-rolled IPC layer. The filesystem polling is correct but fragile.	Replace with MessageBus-based coordination. Workers publish status to MessageBus; coordinator subscribes. See section 3.

3. Streamlined Architecture — The Unified Dispatch Layer

Three-tier conceptual model

┌─────────────────────────────────────────────────────────────┐
│  UOK Kernel (controller)                                      │
│  Decides WHAT to run next; enforces gates, policy, parity     │
│  - Phase machine: Discuss → Plan → Execute → Merge → Complete │
│  - Calls DispatchLayer.dispatch() to execute                   │
└─────────────────────────────┬─────────────────────────────────┘
                              │ DispatchEnvelope { scope, unitId, ... }
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  DispatchLayer (mechanism)                                    │
│  Decides HOW to run: worktree? process? in-process?           │
│  - Worktree pool (git worktree per milestone/slice)            │
│  - Process registry (child_process per worker)                  │
│  - Budget accumulator (cost tracking via NDJSON parsing)        │
│  - File-intent tracker (parallel-intent.js)                      │
│  - AgentInbox per worker (MessageBus integration)               │
└─────────────────────────────┬─────────────────────────────────┘
                              │ spawns
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Worker (execution unit)                                      │
│  `sf headless --json autonomous` in a worktree                │
│  - Owns SQLite WAL connection to project DB                   │
│  - Has AgentInbox for MessageBus delivery                     │
│  - Emits NDJSON events consumed by DispatchLayer              │
└─────────────────────────────────────────────────────────────┘

Subagent tool relationship to DispatchLayer

The subagent tool and DispatchLayer serve different dispatch scopes:

subagent tool: in-session, ad-hoc, short-lived. The subagent is a separate sf CLI process spawned from within a running session and its output is returned to the caller synchronously. It is not managed by the DispatchLayer's worktree pool or budget tracking. It spawns sf --mode json (not sf headless), which is correct for its interactive nature.
DispatchLayer: autonomous, long-running, milestone/slice scoped. Workers are spawned and tracked by DispatchLayer; they emit cost events back to the layer; they share the project DB via WAL.

These two paths should remain separate but use the same worker bootstrap (sf headless --json autonomous).

DispatchLayer interface (proposed)

// lives in: src/resources/extensions/sf/dispatch-layer.js

interface DispatchOptions {
  scope: 'milestone' | 'slice';
  milestoneId: string;
  sliceId?: string;
  basePath: string;
  maxWorkers?: number;
  budgetCeiling?: number;
  workerTimeoutMs?: number;
  shellWrapper?: string[];
  useExecutionGraph?: boolean;
}

class DispatchLayer {
  // Returns eligible units filtered by execution-graph conflicts
  async prepare(opts: DispatchOptions): Promise<PrepareResult>;

  // Start workers for given unit IDs
  async start(ids: string[], opts: DispatchOptions): Promise<StartResult>;

  // Stop all or specific workers
  async stop(ids?: string[]): Promise<void>;

  // Pause/resume
  pause(ids?: string[]): void;
  resume(ids?: string[]): void;

  // Read current state (for dashboard)
  getStatus(): DispatchStatus;

  // Shared MessageBus instance
  readonly bus: MessageBus;

  // Budget
  totalCost(): number;
  isBudgetExceeded(): boolean;
}

How UOK kernel uses DispatchLayer

Today, uok/kernel.js runs the autonomous loop and calls into tools like execute_task which eventually spawn agents. The parallel orchestrator is started separately by the TUI dashboard or headless command. After unification:

UOK kernel initializes DispatchLayer at autonomous loop start
UOK calls dispatchLayer.start(eligibleMilestoneIds) for parallel milestones
Workers emit NDJSON events → DispatchLayer parses cost → updates budget
Workers emit completion → UOK kernel processes post-unit staging
Workers can receive messages via their AgentInbox (MessageBus integration)
DispatchLayer.stop() called on autonomous loop exit

4. Multi-Dimensional Parallelism

Three axes of parallelism

Axis	Mechanism	Status
Inter-project	Multiple `sf` invocations (manual or CI)	✅ not SF's concern
Inter-milestone	DispatchLayer + worktrees	✅ currently via parallel-orchestrator
Inter-slice	DispatchLayer + worktrees	✅ currently via slice-parallel-orchestrator
Inter-task (in-process)	subagent `parallel` mode	✅ implemented (mapWithConcurrencyLimit)
Inter-agent (debate/chain)	subagent `debate`/`chain` mode	✅ implemented
Terminal-level	Cmux grid layout for parallel agents	✅ implemented

What "true concurrency" means

The current architecture already achieves true process-level concurrency via worktrees and separate sf headless processes. The shared SQLite WAL means all workers can read the same DB concurrently — WAL allows concurrent readers with a single writer.

What is missing is not more parallelism axes but coordinated dispatch:

The execution graph (uok/execution-graph.js) already computes file-conflict relationships between milestones and slices
selectConflictFreeBatch picks a conflict-free subset for parallel dispatch
But this is only wired into parallel-orchestrator, not into the slice-parallel path or the UOK autonomous loop's dispatch decisions

Proposed coordination model

The execution graph is the source of truth for parallelism constraints. The DispatchLayer is the enforcer. The UOK kernel is the policy layer:

Execution Graph (file-conflict DAG)
    │
    ├── selectConflictFreeBatch() ──► DispatchLayer.start()
    │                                  Workers run in parallel
    │                                  Each worker has AgentInbox
    │
UOK kernel
    │
    ├── reads unit readiness from DB
    ├── calls DispatchLayer.start(milestoneIds)
    └── calls DispatchLayer.start(sliceIds) for intra-milestone parallelism

Debate mode (subagent tool): runs multiple agents sequentially within a single process using mapWithConcurrencyLimit. This is not true process-level parallelism but is correct for LLM-based debate where shared context and a single conversation transcript are needed. The Cmux grid layout provides terminal-level parallelism for these agents via split panes.

Chain mode: purely sequential — each step's output feeds into the next step's prompt. No parallelism needed here.

5. DB Access from Subagents

The current model

Subagents spawn sf CLI as a separate process with its own environment. The inheritance envelope (subagent-inheritance.js) propagates preferences, but the subagent's sf process opens its own SQLite connection to ~/.sf/sf.db (global state) or .sf/sf.db (project state). This is correct isolation — a subagent should not write to the project DB directly.

The constraint is intentional

The subagent tool cannot call complete-task or plan-slice — not because those tools don't exist in the subagent's tool registry, but because:

Only 4 tools are registered in the subagent extension manifest (subagent, await_subagent, cancel_subagent, and the /subagent command)
The subagent is meant to be a task executor, not a state mutator

If a subagent could call complete-task, it could mark tasks done without the coordinator's knowledge, corrupting the UOK state machine.

The right model: two-tier DB access

Coordinator (UOK kernel)     ──►  project .sf/sf.db (WAL mode)
                                    milestone/slice state
                                    task execution ledger

Subagent (sf process)        ──►  ~/.sf/sf.db (global)
                                    memories, preferences
                                    agent-level state
                              ✗    project .sf/sf.db

The subagent can read from the project DB for context (via system prompt injection), but writes only to global state. The inheritanceEnvelope already controls what context the subagent receives.

Exception: The sf CLI that runs as a DispatchLayer worker (sf headless --json autonomous) is a different mode — it IS the coordinator for its worktree's scope and SHOULD write to the project DB. This is already how it works (workers open .sf/sf.db in the worktree, which syncs from the project root via syncSfStateToWorktree).

What subagents CAN do with the DB

Read project state via prompt injection (system context assembly already does this)
Write to global ~/.sf/sf.db for their own memories and preferences
NOT write to the project .sf/sf.db

If a subagent needs to record a finding that the coordinator should see, the right pattern is:

Subagent writes to its output (stdout/file)
Coordinator reads and processes the output
Coordinator calls DB tools

This is the same pattern as Letta agents — agents return results, the orchestrator decides what to persist.

Architectural backing for the constraint

The "no DB tools for subagents" constraint should be backed by a principled access model, not just "we didn't register those tools." Proposed:

// In subagent tool — formalize the access contract
const SUBAGENT_DB_ACCESS = {
  read: ['project_context'],   // via prompt injection only
  write: ['~/.sf/sf.db'],      // global state only
  prohibited: ['project .sf/sf.db write operations']
};

The extension manifest's tools[] array currently enforces this by omission. A more explicit model would declare the access contract formally, making it auditable.

6. Naming — What Should the Mental Model Be?

The names are confusing because they mix three different layers of abstraction. Proposed renaming:

Current name	Proposed name	Reason
`parallel-orchestrator.js`	`milestone-dispatcher.js`	Describes scope + role
`slice-parallel-orchestrator.js`	`slice-dispatcher.js`	Scope + role; merges into unified DispatchLayer
`DispatchLayer` (new)	`dispatch-layer.js`	The unified class
`uok/kernel.js`	keep as-is	Kernel is the right metaphor for the controller
`MessageBus`	keep as-is	Standard pattern name
`Cmux`	keep as-is	Product name for terminal multiplexing
`subagent tool`	keep as-is	The user-facing tool name

Mental model:

Controller = UOK kernel (deterministic policy, what to run)
Dispatcher = DispatchLayer (mechanism, how to run)
Workers = sf headless processes in worktrees (the doing)
Inbox = AgentInbox per worker (message receiving)
Bus = MessageBus (durable inter-agent messaging)
Subagent tool = in-session ad-hoc delegation (separate from the DispatchLayer path)

The confusion arises because "orchestrator" suggests it controls both what and how. In a clean architecture, orchestrator = controller (what), and dispatcher = mechanism (how). Today, parallel-orchestrator does both, which is why it feels heavyweight and why slice-parallel-orchestrator had to be cloned to change scope.

7. Implementation Priority

Phase 1: Eliminate duplication (lowest risk, highest clarity)

1.1 — Merge parallel-orchestrator + slice-parallel-orchestrator

Extract shared logic into a DispatchLayer class parameterized by scope. The slice orchestrator's conflict-filtering logic (filterConflictingSlices) already lives in slice-parallel-conflict.ts and stays there. The merged dispatch-layer.js calls it.

Test: both the /parallel command and the slice-level parallelism continue to work identically. The parallel orchestrator dashboard continues to show milestone workers; slice-level parallelism shows slice workers.

File: new src/resources/extensions/sf/dispatch-layer.js (~400 LOC merged from both orchestrators).

Phase 2: Wire MessageBus into DispatchLayer

2.1 — Add AgentInbox to each worker

Every sf headless worker opens a MessageBus inbox named after its milestone/slice ID. The coordinator can send messages to workers (e.g., "pause", "resume", "report status").

2.2 — Use MessageBus for coordinator → worker signaling

Replace file-based IPC signals (session-status-io.js, sendSignal) with MessageBus send(). The file-based signals already exist as a fallback for crash recovery; MessageBus gives durable at-least-once delivery.

Test: workers respond to coordinator pause/resume messages delivered via MessageBus instead of or in addition to file signals.

Phase 3: UOK kernel adopts DispatchLayer

3.1 — Replace direct parallel-orchestrator calls with DispatchLayer

The autonomous loop's parallel dispatch path (analyzeParallelEligibility → startParallel) goes through DispatchLayer instead of calling parallel-orchestrator directly.

3.2 — UOK reads worker status from DispatchLayer

Dashboard refresh reads from dispatchLayer.getStatus() instead of directly from parallel-orchestrator's state.

File changes: uok/kernel.js imports DispatchLayer; parallel-orchestrator.js becomes a thin wrapper (or is removed if no other callers remain).

Phase 4: Subagent tool gets optional MessageBus inbox

4.1 — Allow subagent workers to opt-in to MessageBus

A subagent spawned with useMessageBus: true in params gets an AgentInbox injected into its prompt context. This enables the subagent to receive coordinator messages during long-running tasks.

Constraint: subagent still cannot write to project DB. MessageBus read access does not change this.

Test: long-running subagent receives a pause message from the coordinator via MessageBus.

Phase 5: Naming cleanup (cosmetic but reduces confusion)

5.1 — Rename parallel-orchestrator.js → milestone-dispatcher.js 5.2 — Rename slice-parallel-orchestrator.js → slice-dispatcher.js Update all import references.

5.3 — Trim uok/index.js exports

Move non-orchestration exports (skills, model policy, etc.) to their own barrels or remove from the UOK public API. The uok/index.js barrel re-exports ~60 symbols from ~30 sub-modules. Some exports (e.g., skill functions, model policy functions) are used only by specific tools and do not belong in an orchestration kernel export.

Summary

The 5 dispatch mechanisms + 1 message bus represent 3 genuinely different needs (UOK autonomous loop, worktree-based isolation, durable inter-agent messaging) and 2 duplications (parallel-orchestrator + slice-parallel-orchestrator; file-based IPC replacing MessageBus). The root cause is that dispatch, orchestration, and coordination evolved separately rather than being designed as layers of one system.

The plan is to:

Merge parallel-orchestrator + slice-parallel-orchestrator into a single DispatchLayer class
Wire MessageBus into DispatchLayer so workers become reachable via durable messaging (replacing file-based IPC)
UOK kernel becomes the controller that calls DispatchLayer, not a parallel system
Subagent tool stays separate — it's ad-hoc in-session delegation, not autonomous dispatch; formalize its DB access contract
Cmux stays orthogonal — it's surface integration, not dispatch

The DB access model is already correct: subagents run in their own process with their own DB connection and cannot write to the project state. Workers (dispatched via DispatchLayer) are the project's own agents and do have project DB write access.

The adversarial_partner/adversarial_combatant/adversarial_architect fields are planning ceremony fields (Letta-inspired) that belong in the PDD planning layer (slice/milestone planning), not in the dispatch layer. They are populated by planning tools and rendered in slice output. The dispatch layer should remain purely about "how to run" — worktree lifecycle, process management, cost tracking, and message delivery.

24 KiB Raw Blame History