singularity-forge/docs/plans/DISPATCH_ARCHITECTURE_CONSOLIDATION.md

309 lines
22 KiB
Markdown

# Dispatch Architecture Consolidation Plan
> **Status**: Draft — for review
> **Author**: Research synthesis from codebase analysis
> **Date**: 2026-05-08
---
## 1. Root Cause Diagnosis — Why the Proliferation Happened
The 5 dispatch mechanisms + 1 message bus are not accidental complexity — they are each responses to a genuine gap that appeared at a different time, with different constraints. The structural symptom is that **dispatch, orchestration, and coordination are conflated into one system**, and SF grew new systems rather than extending existing ones when the use cases diverged.
### The Timeline of Divergence
| Era | Mechanism Added | Gap It Filled |
|-----|----------------|---------------|
| Early SF | `subagent tool` | Ad-hoc delegation: "run this agent for this task" |
| Parallel work | `parallel-orchestrator` | "run milestone X in a worktree, independently" — required isolation at process boundary |
| Slice-level work | `slice-parallel-orchestrator` | Same as above but at finer granularity — duplicate code, not a different concept |
| Autonomous loop | `UOK kernel` | "run the full PDD loop continuously, gated by confidence/risk" |
| Multi-agent messaging | `MessageBus` | "agents need to communicate across turns/sessions" (Letta-style) |
| Surface multiplexing | `Cmux` | "TUI needs multiple visible surfaces for parallel agents" |
### Structural Root Cause
**Single-process thinking drove process-per-unit.** The original SF was a single-agent CLI. When parallelism was needed, the natural answer was `spawn('sf headless')` — a new OS process per milestone. This is correct for isolation but wrong for shared-state coordination. SQLite WAL was bolted on to let workers share a DB, which created the "shared DB with file-based locking" model that all orchestrators now use.
**The UOK kernel was designed as a single-agent loop.** It runs inside the headless process and manages one autonomous run. It does not know about sibling workers, does not coordinate with the parallel orchestrator, and does not have a model for "I am one of N workers running concurrently."
**MessageBus was designed for persistent agents, but SF doesn't have persistent agents yet.** The Letta-style inbox model is architecturally correct but premature — you need durable named agents before durable named inboxes matter. Today the MessageBus is used for UOK internal observer chains but not for real multi-agent coordination.
**Subagent tool was never designed to integrate with SF's state.** It spawns `sf` CLI which is a full TUI/CLI binary. It cannot call SF tools like `complete-task` or `plan-slice` because those are registered in the headless RPC path, not in the subagent's spawned CLI context. The 4 registered tools (subagent, scout, reviewer, reporter) are intentionally narrow to avoid dangerous nested dispatch.
### The Concretion
The proliferation is a **symptom of three missing abstractions**:
1. **No unified "dispatch context"** — subagent, parallel-orchestrator, and UOK each create their own notion of "what am I running and with what environment"
2. **No shared dispatch registry** — there is no single place that tracks "what is currently running" across all parallelism dimensions
3. **No first-class "work unit" concept** — milestone, slice, and task are different tables with different lock semantics, not different states of the same work unit
---
## 2. What Should Stay vs Merge
### Keep (Genuinely Different Needs)
| Mechanism | Reason to Keep |
|-----------|---------------|
| **UOK kernel** | This is the autonomous loop engine. It implements the PDD gate model (confidence/risk/reversibility/blast-radius/cost). Removing it means rewriting autonomous mode from scratch. It should be the *inner loop* of dispatch, not replaced by it. |
| **MessageBus** | SQLite-backed durable inbox is the right model for cross-turn coordination when agents are long-lived. This is a genuine infrastructure primitive. However: it should be *repurposed*, not extended — it serves UOK diagnostics today and should serve agent handoff tomorrow. |
| **Cmux** | This is surface-layer multiplexing (terminal UI). It belongs in `pi-tui`, not in the dispatch layer. It should be *decoupled* from dispatch entirely — the parallel orchestrator should not know about Cmux grid layouts. |
### Merge (Duplication Without Functional Difference)
| Duplicated | Problem | Resolution |
|------------|---------|-----------|
| `parallel-orchestrator.js` + `slice-parallel-orchestrator.js` | 90% identical code. The only difference is scope (milestone vs slice) and the lock env var name (`SF_MILESTONE_LOCK` vs `SF_SLICE_LOCK`). The conflict detection, worktree management, and worker lifecycle are copy-pasted. | **Merge into a single `WorktreeOrchestrator`** with a `scope` parameter. Share all file overlap detection, worktree lifecycle, and status tracking. |
| **subagent tool's parallel/debate/chain modes** vs **parallel-orchestrator's milestone workers** | Both implement "run multiple things at the same time." The subagent tool does in-process `Promise.all` over spawned `sf` CLIs; the parallel orchestrator does the same over `sf headless` with worktrees. They use different IPC mechanisms and different isolation models. | **Subagent tool should delegate to the unified orchestrator** for multi-agent work, rather than managing its own concurrency pool. The subagent tool keeps single-agent dispatch (its core value) but offloads parallel/debate to the orchestrator layer. |
### Refactor (Same Need, Wrong Implementation)
| Current | Issue | Refactor |
|---------|-------|----------|
| **subagent spawning `sf` CLI** | Full CLI binary with TUI/headless分辨. Subagent is a thin wrapper that spawns a binary, not a dispatch primitive. The 4-tool limitation is a workaround for not having a proper dispatch API. | Subagent should use a **headless RPC client** directly, not spawn `sf`. This allows it to call any SF tool, not just the 4 registered ones. |
| **parallel-orchestrator + slice-parallel using SQLite WAL + file IPC** | Workers coordinate via `sf headless` + session status files + signal files. This is a hand-rolled IPC layer. The status files are "poll the filesystem" coordination — correct but fragile. | Replace with **MessageBus-based coordination**. Workers publish status to MessageBus; coordinator subscribes. Eliminates file-based IPC and session status polling. |
| **UOK kernel owning the autonomous loop** | The kernel runs inside a headless process. When the parallel orchestrator spawns `sf headless autonomous`, each worker has its own UOK kernel. Coordination between kernels requires external signals. | The UOK kernel should be the **runtime environment** for any autonomous dispatch, not a process-bound concept. The orchestrator manages worktree lifecycle; the kernel manages turn-level execution within each worktree. |
---
## 3. Streamlined Architecture
### The Unified Dispatch Layer
```
┌─────────────────────────────────────────────────────────────────────┐
│ Unified Dispatch API (UDA) │
├─────────────────────────────────────────────────────────────────────┤
│ dispatch.work({ unit, mode, model, tools, cwd, signal }) │
│ dispatch.batch([{ unit, ... }, { unit, ... }], { strategy }) │
│ dispatch.chain([{ unit, after }, ...]) │
│ dispatch.debate([{ unit, role }, ...], { rounds }) │
│ dispatch.subscribe(handler) // for events: start, end, error, log │
│ dispatch.cancel(workId) │
│ dispatch.status() → { active: WorkInfo[] } │
└─────────────────────────────────────────────────────────────────────┘
```
**Modes**: `isolated` (worktree), `shared` (same process), `rpc` (separate process via headless)
### How the Existing Components Map
| Component | Role in Unified Architecture |
|-----------|------------------------------|
| **subagent tool** | Becomes a thin **UDA client** in the TUI. Single-agent dispatch with the full SF tool access. Keeps the 4-mode interface (single/parallel/debate/chain) but implemented via UDA, not spawned CLI. |
| **parallel-orchestrator + slice-parallel** | Merge into **WorktreeOrchestrator** — a UDA backend that manages worktree lifecycle and multi-slot execution. Implements `dispatch.work({ mode: 'isolated' })` for milestone/slice workers. |
| **UOK kernel** | Becomes **UOK runtime** — a UDA execution mode that wraps any dispatch with the PDD gate model. A `dispatch.work({ unit, runControl: 'autonomous' })` automatically uses the UOK runtime. The kernel is not a separate process; it's the execution strategy. |
| **MessageBus** | Becomes the **UDA event/logging backbone**. All dispatch events (start, end, tool call, error, cost) are published to MessageBus. The parallel orchestrator's file-based IPC is replaced by MessageBus subscriptions. |
| **Cmux** | **Decoupled entirely**. Cmux listens to MessageBus for dispatch events and renders grid layouts accordingly. The dispatch layer does not know about Cmux. |
### The Mental Model: Dispatch Is a Service, Not a Tool
The unified dispatch API is a service (backed by WorktreeOrchestrator + UOK runtime) that SF agents and tools call. It is not a tool itself and is not registered as one.
```
Agent/Tool Dispatch Service
│ │
├── dispatch.work() ───────►│ Spawns worktree, runs UOK loop
│ │
│◄──── work.start event ────┤
│◄──── work.end event ──────┤
├── dispatch.batch() ───────►│ Runs N work items in parallel
│ │ (via WorktreeOrchestrator)
├── dispatch.chain() ───────►│ Runs N items sequentially, passes
│ │ previous output as {previous} input
```
---
## 4. Multi-Dimensional Parallelism
SF needs to run multiple things concurrently at multiple levels:
| Dimension | Example | Current Implementation |
|-----------|---------|----------------------|
| **Unit (milestone/slice)** | Two milestones simultaneously | `parallel-orchestrator` (worktree-per-milestone) |
| **Agent within unit** | Two agents working on the same slice | `subagent parallel mode` (Promise.all over spawned CLIs) |
| **Turn within agent** | Agent running autonomous loop | `UOK kernel` (single-threaded, event loop) |
| **Tool within turn** | Concurrent tool executions | Not supported (single-threaded LLM dispatch) |
### What Should Actually Be Parallel
**The real parallelism need is at the unit level**, not at the agent level. Milestones and slices are the natural parallelism boundary because:
- They have independent file scope (reduced conflict surface)
- They are tracked independently in the DB
- They have independent cost budgets
- They can recover independently from failure
**Agent-level parallelism within a unit** (subagent parallel/debate) is useful for review and research tasks but is not the primary parallelism mode. It should remain but as a secondary mechanism.
### Proposed Multi-Dimensional Model
```
WorktreeOrchestrator
├── slot[0] → worktree for milestone M1
│ └── UOK kernel running autonomous loop
│ ├── turn[0]: agent dispatch
│ └── turn[1]: agent dispatch (sequential within unit)
├── slot[1] → worktree for milestone M2
│ └── UOK kernel running autonomous loop
└── slot[2] → worktree for slice S1 (within M1)
└── UOK kernel running autonomous loop
```
**Constraints:**
- Worktrees provide filesystem isolation (required for concurrent file mutations)
- Each worktree runs one UOK kernel (not multiple concurrent kernels per worktree)
- The kernel turn loop is sequential within a worktree (correct — you can't have two LLM turns modifying state simultaneously)
- Tool-level parallelism (e.g., running `grep` and `read` simultaneously) is not needed — the LLM dispatches tools serially
### Concurrency Limits
| Level | Max Concurrent |
|-------|---------------|
| Project (milestones) | `parallel.max_workers` config (default: CPU cores / 2) |
| Milestone (slices) | `parallel.slice_max_workers` config (default: 2) |
| Subagent parallel tasks | `MAX_CONCURRENCY = 4` (current hardcoded) |
---
## 5. DB Access from Subagents
### The Current Constraint
The subagent tool cannot call SF DB tools (`complete-task`, `plan-slice`, etc.) because:
1. It spawns `sf` CLI which is a full binary with its own extension registration
2. The spawned CLI does not share the parent process's RPC connection
3. The 4 registered tools (subagent, scout, reviewer, reporter) are intentionally all that's available
This is a **correct security isolation**, not a bug. A spawned `sf` CLI with full SF tool access running in a user-specified `cwd` is a significant attack surface.
### The Right Model
**Layer 1 — No direct DB access from subagents (correct, keep it)**
Subagents should not have direct SQLite access. The DB is the source of truth for the primary agent's state; subagents reading it creates consistency hazards.
**Layer 2 — Structured output from subagents (keep and expand)**
Subagents return structured output (via `--mode json` + event stream). The parent agent is responsible for interpreting the output and calling the appropriate DB tools. This is the "subagent as a function" model — it returns data, not mutations.
**Layer 3 — Intention declaration for later commit**
For cases where a subagent needs to propose a state change (e.g., "I found this issue, mark the slice as blocked"), the subagent should return a structured **intention** (e.g., `{ intended_action: "block_slice", slice_id: "S01", reason: "..." }`). The parent agent reviews and commits it via its own DB tools.
**Layer 4 — Shared WAL for read-your-own-writes consistency (future)**
When the UDA runs subagents in the same process (not spawned CLI), it can share the DB connection. This enables the subagent to read what the parent just wrote in the same transaction. This requires the subagent to run as a headless RPC client, not a spawned CLI.
### Recommendation
**Keep the current constraint for spawned-CLI subagents.** The 4-tool limit is a security boundary, not a limitation to be fixed.
**Add a new subagent mode**`dispatch.work({ mode: 'rpc' })` — where the subagent runs as an RPC client in the same process, gaining access to all SF tools. This is the headless equivalent of the subagent tool. Use this for internal SF workflows (e.g., "dispatch a review subagent that calls `complete-task`").
---
## 6. Naming — The Mental Model
The current names reflect implementation history, not user intent. Here is what they should be:
### Current → Proposed
| Current | Problem | Proposed | Rationale |
|---------|---------|----------|-----------|
| `subagent` tool | "subagent" implies a lesser agent, not a dispatch primitive | `dispatch` tool (in TUI) | The tool *is* the dispatch API surface |
| `parallel-orchestrator` | "orchestrator" is vague; doesn't convey worktree isolation | `worktree-pool` or `worktree-scheduler` | Conveys the resource model |
| `slice-parallel-orchestrator` | Duplicate of above | Merge into `worktree-pool` | See section 2 |
| `UOK kernel` | "kernel" implies OS-level; "UOK" is jargon | `autonomous-runtime` or `UOK` stays if we accept the acronym | "UOK" means "unit-of-work kernel" internally; can keep if documented |
| `MessageBus` | Generic; doesn't convey durability | `agent-inbox` (but it's more than a bus) | Actually, `MessageBus` is fine — it is a bus pattern. Keep it. |
| `Cmux` | "cmux" is implementation detail of terminal multiplexing | `surface-grid` | User-facing concept: "show agents in a grid" |
### The Unified Naming Hierarchy
```
dispatch — The high-level API and TUI tool name
├── work() — Run a single unit (milestone/slice/task)
├── batch() — Run multiple units in parallel (worktree pool)
├── chain() — Run units sequentially, passing output
├── debate() — Run units as adversarial roles
└── subscribe() — Listen to dispatch events
worktree-pool — The backend that manages worktree lifecycle
autonomous-runtime — The PDD-gated execution loop (UOK kernel)
MessageBus — Durable inter-agent messaging
```
---
## 7. Implementation Priority
This is a large refactor. The work should be sequenced to avoid breaking the current system while building the new one underneath.
### Phase 1 — Foundation (Weeks 1-3)
**Goal**: Establish the UDA backbone without changing existing behavior.
| Task | Why | Risk |
|------|-----|------|
| Extract a minimal `dispatch-worktree` module from `parallel-orchestrator.js` that just manages worktree lifecycle (create/remove/heartbeat) | The worktree management is the most isolated piece and the easiest to extract first | Low |
| Add MessageBus subscriptions to `dispatch-worktree` for worker status (replacing session status file polling) | MessageBus already exists; this just redirects the existing file-based IPC | Low |
| Create `dispatch-chain` module that takes an array of `{ unit, afterId }` and runs them sequentially, passing output | Reuses worktree-pool; no new parallelism semantics | Low |
| **Do NOT change subagent tool or parallel-orchestrator yet** | These must keep working while foundation is laid | — |
### Phase 2 — Merge (Weeks 4-6)
**Goal**: Eliminate duplication, keep external behavior identical.
| Task | Why | Risk |
|------|-----|------|
| Merge `slice-parallel-orchestrator.js` into `dispatch-worktree` as `scope: 'slice'` parameter | 90% code duplication; this is a pure refactor | Medium |
| Replace `parallel-orchestrator`'s file-based IPC with MessageBus subscriptions | Changes the coordination mechanism but not the external API | Medium |
| Add `dispatch-batch()` that calls `dispatch-worktree` for N units | Reuses the same worktree pool; just adds the batch interface | Low |
| Verify all existing parallel orchestrator tests still pass | Regression protection | Low |
### Phase 3 — Subagent RPC Mode (Weeks 7-8)
**Goal**: Subagent gains headless RPC access without spawning CLI.
| Task | Why | Risk |
|------|-----|------|
| Add `dispatch.rpc()` — spawn a headless RPC client (not CLI) for a subagent | The 4-tool limitation goes away when subagent is an RPC client | Medium |
| Wire `subagent({ mode: 'rpc' })` to use `dispatch.rpc()` | Subagent keeps its 4-mode interface; the implementation changes | Medium |
| Ensure subagent RPC mode cannot access tools the parent mode doesn't permit | Security boundary must be preserved | Medium |
### Phase 4 — UOK as Execution Mode (Weeks 9-10)
**Goal**: UOK kernel becomes a dispatch execution mode, not a separate process.
| Task | Why | Risk |
|------|-----|------|
| Refactor `runAutoLoopWithUok` to be `dispatch.autonomous()` — a UDA execution mode | The autonomous loop becomes a configuration of dispatch, not a separate entry point | Medium |
| `sf headless autonomous` calls `dispatch.batch()` with UOK runtime per slot | The headless binary becomes a thin launcher for the dispatch service | Medium |
| Remove the notion of "UOK kernel" as a separate coordination entity | The kernel is an execution context; coordination is dispatch's job | Medium |
### Phase 5 — Cmux Decoupling (Week 11)
**Goal**: Cmux becomes a MessageBus subscriber, not a dispatch-aware component.
| Task | Why | Risk |
|------|-----|------|
| Make Cmux grid layout creation driven by MessageBus events, not by dispatch calling Cmux directly | Dispatch should not know about terminal surface implementation | Low |
| Remove `cmuxSplitsEnabled` from subagent tool | This is the concrete coupling point — dispatch knows about Cmux grid layouts | Low |
### Phase 6 — Naming Cleanup (Week 12)
**Goal**: Rename things to match the mental model once the refactor is stable.
| Task | Why | Risk |
|------|-----|------|
| Rename `subagent` tool to `dispatch` in the TUI (keep `subagent` as alias) | User-facing naming should match the mental model | Low |
| Rename `parallel-orchestrator` file to `worktree-pool.js` | Internal naming | Low |
| Document the architecture in `ARCHITECTURE.md` | The current dispatch docs are scattered | Low |
---
## Summary
The 5 dispatch mechanisms + 1 message bus represent 3 genuinely different needs (UOK autonomous loop, worktree-based isolation, durable inter-agent messaging) and 3 duplications (parallel-orchestrator + slice-parallel-orchestrator; subagent parallel mode + parallel-orchestrator; Cmux tight coupling). The root cause is that dispatch, orchestration, and coordination evolved separately rather than being designed as layers of one system.
**The plan is to:**
1. Merge `parallel-orchestrator` + `slice-parallel-orchestrator` into a single `WorktreePool`
2. Make subagent an RPC client of a unified `Dispatch` service, not a spawned CLI
3. Make UOK an execution *mode* of the dispatch service, not a separate process
4. Make MessageBus the event backbone replacing all file-based IPC
5. Decouple Cmux from dispatch entirely (it subscribes to MessageBus)
6. Sequence the refactor so existing behavior is preserved at each step