singularity-forge/docs/plans/dispatch-orchestration-architecture.md

379 lines
24 KiB
Markdown

# Dispatch/Orchestration Architecture — Consolidation Plan
**Author:** Research synthesis
**Date:** 2026-05-08
**Status:** Draft — for review and promotion
---
## 1. Root Cause Diagnosis — Why Did This Proliferation Happen?
The five dispatch mechanisms + 1 message bus grew to fill genuine gaps, not from poor design. But the structural symptom is consistent with every system that accumulates dispatch primitives without a unifying abstraction: **there is no single concept that unifies them.**
Each addition was driven by a real gap at a different time:
| Mechanism | Gap filled | Structural symptom |
|---|---|---|
| **subagent tool** (`extensions/subagent/index.js`) | Ad-hoc delegation from within a TUI/headless session | First-class spawning of a full CLI process via `spawn()`; only 4 tools registered; no DB tools |
| **parallel-orchestrator** (`parallel-orchestrator.js`) | True parallel milestone execution with git worktree isolation | Mirrors subagent's `spawn` pattern but at milestone scope with session status files, cost accumulation, and file-intent tracking |
| **slice-parallel-orchestrator** (`slice-parallel-orchestrator.js`) | Slice-level parallelism within a milestone | Copy-paste of parallel-orchestrator with scope changed; ~90% identical code |
| **UOK kernel** (`uok/kernel.js`) | Deterministic autonomous loop with gates, observability, parity reporting | Grew into the central orchestration engine but does not subsume the dispatch primitives below it |
| **MessageBus** (`uok/message-bus.js`) | Durable SQLite-backed inter-agent messaging for multi-agent coordination | Modeled on Letta's SQLite-backed messaging; lives in UOK but is not wired into subagent or parallel-orchestrator dispatch paths |
| **Cmux** (`cmux/index.js`) | RPC multiplexing and terminal surface integration | Orthogonal to dispatch — a UI/surface concern, not an orchestration concern |
### The Concretion
**Three missing abstractions drove the proliferation:**
1. **No unified "dispatch context"** — subagent, parallel-orchestrator, and UOK each create their own notion of "what am I running and with what environment." The result is three different spawn patterns, three different ways of tracking cost, and no shared vocabulary.
2. **No shared dispatch registry** — there is no single place that tracks "what is currently running" across all parallelism dimensions. The parallel orchestrator tracks milestone workers via session status files; the slice-parallel orchestrator tracks slice workers separately; subagent tracks spawned processes in a `Set`. These are not unified.
3. **No first-class "work unit" concept** — milestone, slice, and task are different tables with different lock semantics, not different states of the same work unit. This is why the slice-parallel orchestrator had to be a near-total copy of the milestone orchestrator rather than a parameterization.
**The UOK kernel was designed as a single-agent loop.** It runs inside the headless process and manages one autonomous run. It does not know about sibling workers, does not coordinate with the parallel orchestrator, and does not have a model for "I am one of N workers running concurrently."
**Subagent tool was never designed to integrate with SF's state.** It spawns `sf` CLI which is a full binary with its own extension registration. It cannot call SF tools like `complete-task` or `plan-slice` because those are registered in the headless RPC path, not in the subagent's spawned CLI context. The 4 registered tools are intentionally narrow to avoid dangerous nested dispatch.
**MessageBus was designed for persistent agents, but SF doesn't have persistent agents yet.** The Letta-style inbox model is architecturally correct but premature — you need durable named agents before durable named inboxes matter. Today the MessageBus is used for UOK internal observer chains but not for real multi-agent coordination.
### The `adversarial_partner/combatant/architect` Fields
These DB fields (in `slices` table, `sf-db.js`) are **planning ceremony fields**, not dispatch mechanism fields. They belong in the PDD planning layer and are rendered in `markdown-renderer.js` and `workflow-projections.js` as "Partner Review", "Combatant Review", and "Architect Review" sections in slice output. They have nothing to do with the dispatch layer — they are populated by planning tools, not by dispatch.
---
## 2. What Should Stay vs Merge
### Stay (genuinely different concerns)
| Mechanism | Reason to Keep |
|-----------|---------------|
| **subagent tool** (`extensions/subagent/index.js`) | Ad-hoc in-session delegation. The 4-tool surface (`subagent`, `await_subagent`, `cancel_subagent`, `/subagent` command) is the right interface for human-in-the-loop or autonomous session agents that need to spin up a helper without leaving their context. The restriction to only those 4 tools is intentional and correct. The subagent spawns `sf --mode json` (not `sf headless`), which is correct for its shorter-lived, interactive nature. |
| **UOK kernel** (`uok/kernel.js`, `uok/index.js`) | The deterministic autonomous loop with gate evaluation, parity reporting, audit envelopes, and run-control policy. This is the **controller** in the architecture sense. It decides what to run next; it does not implement how to run it. The `runAutoLoopWithUok` function is correctly scoped. |
| **MessageBus** (`uok/message-bus.js`) | Durable SQLite-backed inter-agent messaging. The `send`, `broadcast`, `sendOnce`, `getConversation`, and `AgentInbox` primitives are genuinely useful for multi-agent coordination. The Letta-style design is sound. The problem is it is not wired into the dispatch path — agents spawned by subagent or parallel-orchestrator cannot use it. |
| **Cmux** (`cmux/index.js`) | RPC multiplexing and terminal surface integration. Orthogonal to dispatch — a UI/surface concern, not an orchestration concern. Correctly scoped as a UI/shell concern. |
| **Execution graph** (`uok/execution-graph.js`) | The file-conflict DAG that computes which milestones/slices can run in parallel. This is the **constraint solver** — it knows about file overlaps but not about process lifecycle. |
| **CoordinationStore** (`uok/coordination-store.js`) | Redis-like primitives (TTL KV, streams, lease-based queues) on SQLite. Right building block for durable background coordination without a server process. |
### Merge (duplication with no semantic difference)
| Duplicated | Problem | Resolution |
|------------|---------|-----------|
| `parallel-orchestrator.js` + `slice-parallel-orchestrator.js` | ~90% identical code. The only meaningful differences: scope (milestone vs slice), lock env vars (`SF_MILESTONE_LOCK` vs `SF_SLICE_LOCK` + `SF_MILESTONE_LOCK`), and status file naming (`milestoneId` vs `milestoneId/sliceId`). The conflict detection, worktree management, worker lifecycle, NDJSON parsing, and cost tracking are copy-pasted. | **Merge into a single `WorktreeOrchestrator` class** parameterized by `{ scope: 'milestone' | 'slice', milestoneId, sliceId? }`. The conflict-filtering logic already lives in `slice-parallel-conflict.ts` and `selectConflictFreeBatch` in `execution-graph.js` — these stay separate as the constraint layer. |
### Refactor (same need, wrong implementation)
| Current | Issue | Refactor |
|---------|-------|----------|
| **subagent spawning `sf` CLI** | The subagent tool spawns `sf` CLI as a full binary. The 4-tool limitation is enforced by not registering other tools, not by a principled access model. | Keep spawning `sf` CLI for security isolation, but formalize the access contract explicitly. See section 5. |
| **parallel-orchestrator + slice-parallel using file-based IPC** | Workers coordinate via `session-status-io.js` (filesystem polling) and `sendSignal`. This is a hand-rolled IPC layer. The filesystem polling is correct but fragile. | Replace with MessageBus-based coordination. Workers publish status to MessageBus; coordinator subscribes. See section 3. |
---
## 3. Streamlined Architecture — The Unified Dispatch Layer
### Three-tier conceptual model
```
┌─────────────────────────────────────────────────────────────┐
│ UOK Kernel (controller) │
│ Decides WHAT to run next; enforces gates, policy, parity │
│ - Phase machine: Discuss → Plan → Execute → Merge → Complete │
│ - Calls DispatchLayer.dispatch() to execute │
└─────────────────────────────┬─────────────────────────────────┘
│ DispatchEnvelope { scope, unitId, ... }
┌─────────────────────────────────────────────────────────────┐
│ DispatchLayer (mechanism) │
│ Decides HOW to run: worktree? process? in-process? │
│ - Worktree pool (git worktree per milestone/slice) │
│ - Process registry (child_process per worker) │
│ - Budget accumulator (cost tracking via NDJSON parsing) │
│ - File-intent tracker (parallel-intent.js) │
│ - AgentInbox per worker (MessageBus integration) │
└─────────────────────────────┬─────────────────────────────────┘
│ spawns
┌─────────────────────────────────────────────────────────────┐
│ Worker (execution unit) │
│ `sf headless --json autonomous` in a worktree │
│ - Owns SQLite WAL connection to project DB │
│ - Has AgentInbox for MessageBus delivery │
│ - Emits NDJSON events consumed by DispatchLayer │
└─────────────────────────────────────────────────────────────┘
```
### Subagent tool relationship to DispatchLayer
The subagent tool and DispatchLayer serve **different dispatch scopes**:
- **subagent tool**: in-session, ad-hoc, short-lived. The subagent is a separate `sf` CLI process spawned from within a running session and its output is returned to the caller synchronously. It is **not** managed by the DispatchLayer's worktree pool or budget tracking. It spawns `sf --mode json` (not `sf headless`), which is correct for its interactive nature.
- **DispatchLayer**: autonomous, long-running, milestone/slice scoped. Workers are spawned and tracked by DispatchLayer; they emit cost events back to the layer; they share the project DB via WAL.
These two paths should remain separate but use the **same worker bootstrap** (`sf headless --json autonomous`).
### DispatchLayer interface (proposed)
```ts
// lives in: src/resources/extensions/sf/dispatch-layer.js
interface DispatchOptions {
scope: 'milestone' | 'slice';
milestoneId: string;
sliceId?: string;
basePath: string;
maxWorkers?: number;
budgetCeiling?: number;
workerTimeoutMs?: number;
shellWrapper?: string[];
useExecutionGraph?: boolean;
}
class DispatchLayer {
// Returns eligible units filtered by execution-graph conflicts
async prepare(opts: DispatchOptions): Promise<PrepareResult>;
// Start workers for given unit IDs
async start(ids: string[], opts: DispatchOptions): Promise<StartResult>;
// Stop all or specific workers
async stop(ids?: string[]): Promise<void>;
// Pause/resume
pause(ids?: string[]): void;
resume(ids?: string[]): void;
// Read current state (for dashboard)
getStatus(): DispatchStatus;
// Shared MessageBus instance
readonly bus: MessageBus;
// Budget
totalCost(): number;
isBudgetExceeded(): boolean;
}
```
### How UOK kernel uses DispatchLayer
Today, `uok/kernel.js` runs the autonomous loop and calls into tools like `execute_task` which eventually spawn agents. The parallel orchestrator is started separately by the TUI dashboard or headless command. After unification:
1. UOK kernel initializes `DispatchLayer` at autonomous loop start
2. UOK calls `dispatchLayer.start(eligibleMilestoneIds)` for parallel milestones
3. Workers emit NDJSON events → DispatchLayer parses cost → updates budget
4. Workers emit completion → UOK kernel processes post-unit staging
5. Workers can receive messages via their `AgentInbox` (MessageBus integration)
6. `DispatchLayer.stop()` called on autonomous loop exit
---
## 4. Multi-Dimensional Parallelism
### Three axes of parallelism
| Axis | Mechanism | Status |
|---|---|---|
| **Inter-project** | Multiple `sf` invocations (manual or CI) | ✅ not SF's concern |
| **Inter-milestone** | DispatchLayer + worktrees | ✅ currently via parallel-orchestrator |
| **Inter-slice** | DispatchLayer + worktrees | ✅ currently via slice-parallel-orchestrator |
| **Inter-task** (in-process) | subagent `parallel` mode | ✅ implemented (mapWithConcurrencyLimit) |
| **Inter-agent** (debate/chain) | subagent `debate`/`chain` mode | ✅ implemented |
| **Terminal-level** | Cmux grid layout for parallel agents | ✅ implemented |
### What "true concurrency" means
The current architecture already achieves true process-level concurrency via worktrees and separate `sf headless` processes. The shared SQLite WAL means all workers can read the same DB concurrently — WAL allows concurrent readers with a single writer.
**What is missing is not more parallelism axes but coordinated dispatch:**
- The execution graph (`uok/execution-graph.js`) already computes file-conflict relationships between milestones and slices
- `selectConflictFreeBatch` picks a conflict-free subset for parallel dispatch
- But this is only wired into parallel-orchestrator, not into the slice-parallel path or the UOK autonomous loop's dispatch decisions
### Proposed coordination model
The execution graph is the **source of truth for parallelism constraints**. The DispatchLayer is the **enforcer**. The UOK kernel is the **policy layer**:
```
Execution Graph (file-conflict DAG)
├── selectConflictFreeBatch() ──► DispatchLayer.start()
│ Workers run in parallel
│ Each worker has AgentInbox
UOK kernel
├── reads unit readiness from DB
├── calls DispatchLayer.start(milestoneIds)
└── calls DispatchLayer.start(sliceIds) for intra-milestone parallelism
```
**Debate mode** (subagent tool): runs multiple agents sequentially within a single process using `mapWithConcurrencyLimit`. This is **not** true process-level parallelism but is correct for LLM-based debate where shared context and a single conversation transcript are needed. The Cmux grid layout provides terminal-level parallelism for these agents via split panes.
**Chain mode**: purely sequential — each step's output feeds into the next step's prompt. No parallelism needed here.
---
## 5. DB Access from Subagents
### The current model
Subagents spawn `sf` CLI as a **separate process** with its own environment. The inheritance envelope (`subagent-inheritance.js`) propagates preferences, but the subagent's `sf` process opens its own SQLite connection to `~/.sf/sf.db` (global state) or `.sf/sf.db` (project state). This is **correct isolation** — a subagent should not write to the project DB directly.
### The constraint is intentional
The subagent tool **cannot** call `complete-task` or `plan-slice` — not because those tools don't exist in the subagent's tool registry, but because:
1. Only 4 tools are registered in the subagent extension manifest (`subagent`, `await_subagent`, `cancel_subagent`, and the `/subagent` command)
2. The subagent is meant to be a **task executor**, not a **state mutator**
If a subagent could call `complete-task`, it could mark tasks done without the coordinator's knowledge, corrupting the UOK state machine.
### The right model: two-tier DB access
```
Coordinator (UOK kernel) ──► project .sf/sf.db (WAL mode)
milestone/slice state
task execution ledger
Subagent (sf process) ──► ~/.sf/sf.db (global)
memories, preferences
agent-level state
✗ project .sf/sf.db
```
The subagent can read from the project DB for context (via system prompt injection), but writes only to global state. The `inheritanceEnvelope` already controls what context the subagent receives.
**Exception**: The `sf` CLI that runs as a DispatchLayer worker (`sf headless --json autonomous`) is a different mode — it IS the coordinator for its worktree's scope and SHOULD write to the project DB. This is already how it works (workers open `.sf/sf.db` in the worktree, which syncs from the project root via `syncSfStateToWorktree`).
### What subagents CAN do with the DB
- Read project state via **prompt injection** (system context assembly already does this)
- Write to global `~/.sf/sf.db` for their own memories and preferences
- **NOT** write to the project `.sf/sf.db`
If a subagent needs to record a finding that the coordinator should see, the right pattern is:
1. Subagent writes to its output (stdout/file)
2. Coordinator reads and processes the output
3. Coordinator calls DB tools
This is the same pattern as Letta agents — agents return results, the orchestrator decides what to persist.
### Architectural backing for the constraint
The "no DB tools for subagents" constraint should be backed by a **principled access model**, not just "we didn't register those tools." Proposed:
```ts
// In subagent tool — formalize the access contract
const SUBAGENT_DB_ACCESS = {
read: ['project_context'], // via prompt injection only
write: ['~/.sf/sf.db'], // global state only
prohibited: ['project .sf/sf.db write operations']
};
```
The extension manifest's `tools[]` array currently enforces this by omission. A more explicit model would declare the access contract formally, making it auditable.
---
## 6. Naming — What Should the Mental Model Be?
The names are confusing because they mix three different layers of abstraction. Proposed renaming:
| Current name | Proposed name | Reason |
|---|---|---|
| `parallel-orchestrator.js` | `milestone-dispatcher.js` | Describes scope + role |
| `slice-parallel-orchestrator.js` | `slice-dispatcher.js` | Scope + role; merges into unified DispatchLayer |
| `DispatchLayer` (new) | `dispatch-layer.js` | The unified class |
| `uok/kernel.js` | keep as-is | Kernel is the right metaphor for the controller |
| `MessageBus` | keep as-is | Standard pattern name |
| `Cmux` | keep as-is | Product name for terminal multiplexing |
| `subagent tool` | keep as-is | The user-facing tool name |
**Mental model:**
- **Controller** = UOK kernel (deterministic policy, what to run)
- **Dispatcher** = DispatchLayer (mechanism, how to run)
- **Workers** = `sf headless` processes in worktrees (the doing)
- **Inbox** = AgentInbox per worker (message receiving)
- **Bus** = MessageBus (durable inter-agent messaging)
- **Subagent tool** = in-session ad-hoc delegation (separate from the DispatchLayer path)
The confusion arises because "orchestrator" suggests it controls both what and how. In a clean architecture, orchestrator = controller (what), and dispatcher = mechanism (how). Today, parallel-orchestrator does both, which is why it feels heavyweight and why slice-parallel-orchestrator had to be cloned to change scope.
---
## 7. Implementation Priority
### Phase 1: Eliminate duplication (lowest risk, highest clarity)
**1.1 — Merge parallel-orchestrator + slice-parallel-orchestrator**
Extract shared logic into a `DispatchLayer` class parameterized by scope. The slice orchestrator's conflict-filtering logic (`filterConflictingSlices`) already lives in `slice-parallel-conflict.ts` and stays there. The merged `dispatch-layer.js` calls it.
Test: both the `/parallel` command and the slice-level parallelism continue to work identically. The parallel orchestrator dashboard continues to show milestone workers; slice-level parallelism shows slice workers.
File: new `src/resources/extensions/sf/dispatch-layer.js` (~400 LOC merged from both orchestrators).
### Phase 2: Wire MessageBus into DispatchLayer
**2.1 — Add AgentInbox to each worker**
Every `sf headless` worker opens a `MessageBus` inbox named after its milestone/slice ID. The coordinator can send messages to workers (e.g., "pause", "resume", "report status").
**2.2 — Use MessageBus for coordinator → worker signaling**
Replace file-based IPC signals (`session-status-io.js`, `sendSignal`) with MessageBus `send()`. The file-based signals already exist as a fallback for crash recovery; MessageBus gives durable at-least-once delivery.
Test: workers respond to coordinator pause/resume messages delivered via MessageBus instead of or in addition to file signals.
### Phase 3: UOK kernel adopts DispatchLayer
**3.1 — Replace direct parallel-orchestrator calls with DispatchLayer**
The autonomous loop's parallel dispatch path (`analyzeParallelEligibility``startParallel`) goes through DispatchLayer instead of calling parallel-orchestrator directly.
**3.2 — UOK reads worker status from DispatchLayer**
Dashboard refresh reads from `dispatchLayer.getStatus()` instead of directly from parallel-orchestrator's state.
File changes: `uok/kernel.js` imports `DispatchLayer`; parallel-orchestrator.js becomes a thin wrapper (or is removed if no other callers remain).
### Phase 4: Subagent tool gets optional MessageBus inbox
**4.1 — Allow subagent workers to opt-in to MessageBus**
A subagent spawned with `useMessageBus: true` in params gets an `AgentInbox` injected into its prompt context. This enables the subagent to receive coordinator messages during long-running tasks.
**Constraint**: subagent still cannot write to project DB. MessageBus read access does not change this.
Test: long-running subagent receives a pause message from the coordinator via MessageBus.
### Phase 5: Naming cleanup (cosmetic but reduces confusion)
**5.1 — Rename `parallel-orchestrator.js` → `milestone-dispatcher.js`**
**5.2 — Rename `slice-parallel-orchestrator.js` → `slice-dispatcher.js`**
Update all import references.
**5.3 — Trim `uok/index.js` exports**
Move non-orchestration exports (skills, model policy, etc.) to their own barrels or remove from the UOK public API. The `uok/index.js` barrel re-exports ~60 symbols from ~30 sub-modules. Some exports (e.g., skill functions, model policy functions) are used only by specific tools and do not belong in an orchestration kernel export.
---
## Summary
The 5 dispatch mechanisms + 1 message bus represent 3 genuinely different needs (UOK autonomous loop, worktree-based isolation, durable inter-agent messaging) and 2 duplications (parallel-orchestrator + slice-parallel-orchestrator; file-based IPC replacing MessageBus). The root cause is that dispatch, orchestration, and coordination evolved separately rather than being designed as layers of one system.
**The plan is to:**
1. Merge `parallel-orchestrator` + `slice-parallel-orchestrator` into a single `DispatchLayer` class
2. Wire MessageBus into DispatchLayer so workers become reachable via durable messaging (replacing file-based IPC)
3. UOK kernel becomes the controller that calls DispatchLayer, not a parallel system
4. Subagent tool stays separate — it's ad-hoc in-session delegation, not autonomous dispatch; formalize its DB access contract
5. Cmux stays orthogonal — it's surface integration, not dispatch
The DB access model is already correct: subagents run in their own process with their own DB connection and cannot write to the project state. Workers (dispatched via DispatchLayer) are the project's own agents and do have project DB write access.
The `adversarial_partner`/`adversarial_combatant`/`adversarial_architect` fields are **planning ceremony fields** (Letta-inspired) that belong in the PDD planning layer (slice/milestone planning), not in the dispatch layer. They are populated by planning tools and rendered in slice output. The dispatch layer should remain purely about "how to run" — worktree lifecycle, process management, cost tracking, and message delivery.