Mikael Hugo d7c2663ca5 sf snapshot: uncommitted changes after 113m inactivity

2026-05-08 17:44:49 +02:00

22 KiB

Raw Blame History

Dispatch Architecture Consolidation Plan

Status: Draft — for review
Author: Research synthesis from codebase analysis
Date: 2026-05-08

1. Root Cause Diagnosis — Why the Proliferation Happened

The 5 dispatch mechanisms + 1 message bus are not accidental complexity — they are each responses to a genuine gap that appeared at a different time, with different constraints. The structural symptom is that dispatch, orchestration, and coordination are conflated into one system, and SF grew new systems rather than extending existing ones when the use cases diverged.

The Timeline of Divergence

Era	Mechanism Added	Gap It Filled
Early SF	`subagent tool`	Ad-hoc delegation: "run this agent for this task"
Parallel work	`parallel-orchestrator`	"run milestone X in a worktree, independently" — required isolation at process boundary
Slice-level work	`slice-parallel-orchestrator`	Same as above but at finer granularity — duplicate code, not a different concept
Autonomous loop	`UOK kernel`	"run the full PDD loop continuously, gated by confidence/risk"
Multi-agent messaging	`MessageBus`	"agents need to communicate across turns/sessions" (Letta-style)
Surface multiplexing	`Cmux`	"TUI needs multiple visible surfaces for parallel agents"

Structural Root Cause

Single-process thinking drove process-per-unit. The original SF was a single-agent CLI. When parallelism was needed, the natural answer was spawn('sf headless') — a new OS process per milestone. This is correct for isolation but wrong for shared-state coordination. SQLite WAL was bolted on to let workers share a DB, which created the "shared DB with file-based locking" model that all orchestrators now use.

The UOK kernel was designed as a single-agent loop. It runs inside the headless process and manages one autonomous run. It does not know about sibling workers, does not coordinate with the parallel orchestrator, and does not have a model for "I am one of N workers running concurrently."

MessageBus was designed for persistent agents, but SF doesn't have persistent agents yet. The Letta-style inbox model is architecturally correct but premature — you need durable named agents before durable named inboxes matter. Today the MessageBus is used for UOK internal observer chains but not for real multi-agent coordination.

Subagent tool was never designed to integrate with SF's state. It spawns sf CLI which is a full TUI/CLI binary. It cannot call SF tools like complete-task or plan-slice because those are registered in the headless RPC path, not in the subagent's spawned CLI context. The 4 registered tools (subagent, scout, reviewer, reporter) are intentionally narrow to avoid dangerous nested dispatch.

The Concretion

The proliferation is a symptom of three missing abstractions:

No unified "dispatch context" — subagent, parallel-orchestrator, and UOK each create their own notion of "what am I running and with what environment"
No shared dispatch registry — there is no single place that tracks "what is currently running" across all parallelism dimensions
No first-class "work unit" concept — milestone, slice, and task are different tables with different lock semantics, not different states of the same work unit

2. What Should Stay vs Merge

Keep (Genuinely Different Needs)

Mechanism	Reason to Keep
UOK kernel	This is the autonomous loop engine. It implements the PDD gate model (confidence/risk/reversibility/blast-radius/cost). Removing it means rewriting autonomous mode from scratch. It should be the inner loop of dispatch, not replaced by it.
MessageBus	SQLite-backed durable inbox is the right model for cross-turn coordination when agents are long-lived. This is a genuine infrastructure primitive. However: it should be repurposed, not extended — it serves UOK diagnostics today and should serve agent handoff tomorrow.
Cmux	This is surface-layer multiplexing (terminal UI). It belongs in `pi-tui`, not in the dispatch layer. It should be decoupled from dispatch entirely — the parallel orchestrator should not know about Cmux grid layouts.

Merge (Duplication Without Functional Difference)

Duplicated	Problem	Resolution
`parallel-orchestrator.js` + `slice-parallel-orchestrator.js`	90% identical code. The only difference is scope (milestone vs slice) and the lock env var name (`SF_MILESTONE_LOCK` vs `SF_SLICE_LOCK`). The conflict detection, worktree management, and worker lifecycle are copy-pasted.	Merge into a single `WorktreeOrchestrator` with a `scope` parameter. Share all file overlap detection, worktree lifecycle, and status tracking.
subagent tool's parallel/debate/chain modes vs parallel-orchestrator's milestone workers	Both implement "run multiple things at the same time." The subagent tool does in-process `Promise.all` over spawned `sf` CLIs; the parallel orchestrator does the same over `sf headless` with worktrees. They use different IPC mechanisms and different isolation models.	Subagent tool should delegate to the unified orchestrator for multi-agent work, rather than managing its own concurrency pool. The subagent tool keeps single-agent dispatch (its core value) but offloads parallel/debate to the orchestrator layer.

Refactor (Same Need, Wrong Implementation)

Current	Issue	Refactor
subagent spawning `sf` CLI	Full CLI binary with TUI/headless分辨. Subagent is a thin wrapper that spawns a binary, not a dispatch primitive. The 4-tool limitation is a workaround for not having a proper dispatch API.	Subagent should use a headless RPC client directly, not spawn `sf`. This allows it to call any SF tool, not just the 4 registered ones.
parallel-orchestrator + slice-parallel using SQLite WAL + file IPC	Workers coordinate via `sf headless` + session status files + signal files. This is a hand-rolled IPC layer. The status files are "poll the filesystem" coordination — correct but fragile.	Replace with MessageBus-based coordination. Workers publish status to MessageBus; coordinator subscribes. Eliminates file-based IPC and session status polling.
UOK kernel owning the autonomous loop	The kernel runs inside a headless process. When the parallel orchestrator spawns `sf headless autonomous`, each worker has its own UOK kernel. Coordination between kernels requires external signals.	The UOK kernel should be the runtime environment for any autonomous dispatch, not a process-bound concept. The orchestrator manages worktree lifecycle; the kernel manages turn-level execution within each worktree.

3. Streamlined Architecture

The Unified Dispatch Layer

┌─────────────────────────────────────────────────────────────────────┐
│                     Unified Dispatch API (UDA)                       │
├─────────────────────────────────────────────────────────────────────┤
│  dispatch.work({ unit, mode, model, tools, cwd, signal })           │
│  dispatch.batch([{ unit, ... }, { unit, ... }], { strategy })       │
│  dispatch.chain([{ unit, after }, ...])                             │
│  dispatch.debate([{ unit, role }, ...], { rounds })                 │
│  dispatch.subscribe(handler)  // for events: start, end, error, log  │
│  dispatch.cancel(workId)                                             │
│  dispatch.status() → { active: WorkInfo[] }                          │
└─────────────────────────────────────────────────────────────────────┘

Modes: isolated (worktree), shared (same process), rpc (separate process via headless)

How the Existing Components Map

Component	Role in Unified Architecture
subagent tool	Becomes a thin UDA client in the TUI. Single-agent dispatch with the full SF tool access. Keeps the 4-mode interface (single/parallel/debate/chain) but implemented via UDA, not spawned CLI.
parallel-orchestrator + slice-parallel	Merge into WorktreeOrchestrator — a UDA backend that manages worktree lifecycle and multi-slot execution. Implements `dispatch.work({ mode: 'isolated' })` for milestone/slice workers.
UOK kernel	Becomes UOK runtime — a UDA execution mode that wraps any dispatch with the PDD gate model. A `dispatch.work({ unit, runControl: 'autonomous' })` automatically uses the UOK runtime. The kernel is not a separate process; it's the execution strategy.
MessageBus	Becomes the UDA event/logging backbone. All dispatch events (start, end, tool call, error, cost) are published to MessageBus. The parallel orchestrator's file-based IPC is replaced by MessageBus subscriptions.
Cmux	Decoupled entirely. Cmux listens to MessageBus for dispatch events and renders grid layouts accordingly. The dispatch layer does not know about Cmux.

The Mental Model: Dispatch Is a Service, Not a Tool

The unified dispatch API is a service (backed by WorktreeOrchestrator + UOK runtime) that SF agents and tools call. It is not a tool itself and is not registered as one.

Agent/Tool                  Dispatch Service
     │                           │
     ├── dispatch.work() ───────►│ Spawns worktree, runs UOK loop
     │                           │
     │◄──── work.start event ────┤
     │◄──── work.end event ──────┤
     │
     ├── dispatch.batch() ───────►│ Runs N work items in parallel
     │                           │ (via WorktreeOrchestrator)
     │
     ├── dispatch.chain() ───────►│ Runs N items sequentially, passes
     │                           │ previous output as {previous} input

4. Multi-Dimensional Parallelism

SF needs to run multiple things concurrently at multiple levels:

Dimension	Example	Current Implementation
Unit (milestone/slice)	Two milestones simultaneously	`parallel-orchestrator` (worktree-per-milestone)
Agent within unit	Two agents working on the same slice	`subagent parallel mode` (Promise.all over spawned CLIs)
Turn within agent	Agent running autonomous loop	`UOK kernel` (single-threaded, event loop)
Tool within turn	Concurrent tool executions	Not supported (single-threaded LLM dispatch)

What Should Actually Be Parallel

The real parallelism need is at the unit level, not at the agent level. Milestones and slices are the natural parallelism boundary because:

They have independent file scope (reduced conflict surface)
They are tracked independently in the DB
They have independent cost budgets
They can recover independently from failure

Agent-level parallelism within a unit (subagent parallel/debate) is useful for review and research tasks but is not the primary parallelism mode. It should remain but as a secondary mechanism.

Proposed Multi-Dimensional Model

WorktreeOrchestrator
├── slot[0] → worktree for milestone M1
│             └── UOK kernel running autonomous loop
│                 ├── turn[0]: agent dispatch
│                 └── turn[1]: agent dispatch (sequential within unit)
├── slot[1] → worktree for milestone M2
│             └── UOK kernel running autonomous loop
└── slot[2] → worktree for slice S1 (within M1)
              └── UOK kernel running autonomous loop

Constraints:

Worktrees provide filesystem isolation (required for concurrent file mutations)
Each worktree runs one UOK kernel (not multiple concurrent kernels per worktree)
The kernel turn loop is sequential within a worktree (correct — you can't have two LLM turns modifying state simultaneously)
Tool-level parallelism (e.g., running grep and read simultaneously) is not needed — the LLM dispatches tools serially

Concurrency Limits

Level	Max Concurrent
Project (milestones)	`parallel.max_workers` config (default: CPU cores / 2)
Milestone (slices)	`parallel.slice_max_workers` config (default: 2)
Subagent parallel tasks	`MAX_CONCURRENCY = 4` (current hardcoded)

5. DB Access from Subagents

The Current Constraint

The subagent tool cannot call SF DB tools (complete-task, plan-slice, etc.) because:

It spawns sf CLI which is a full binary with its own extension registration
The spawned CLI does not share the parent process's RPC connection
The 4 registered tools (subagent, scout, reviewer, reporter) are intentionally all that's available

This is a correct security isolation, not a bug. A spawned sf CLI with full SF tool access running in a user-specified cwd is a significant attack surface.

The Right Model

Layer 1 — No direct DB access from subagents (correct, keep it) Subagents should not have direct SQLite access. The DB is the source of truth for the primary agent's state; subagents reading it creates consistency hazards.

Layer 2 — Structured output from subagents (keep and expand) Subagents return structured output (via --mode json + event stream). The parent agent is responsible for interpreting the output and calling the appropriate DB tools. This is the "subagent as a function" model — it returns data, not mutations.

Layer 3 — Intention declaration for later commit For cases where a subagent needs to propose a state change (e.g., "I found this issue, mark the slice as blocked"), the subagent should return a structured intention (e.g., { intended_action: "block_slice", slice_id: "S01", reason: "..." }). The parent agent reviews and commits it via its own DB tools.

Layer 4 — Shared WAL for read-your-own-writes consistency (future) When the UDA runs subagents in the same process (not spawned CLI), it can share the DB connection. This enables the subagent to read what the parent just wrote in the same transaction. This requires the subagent to run as a headless RPC client, not a spawned CLI.

Recommendation

Keep the current constraint for spawned-CLI subagents. The 4-tool limit is a security boundary, not a limitation to be fixed.

Add a new subagent mode — dispatch.work({ mode: 'rpc' }) — where the subagent runs as an RPC client in the same process, gaining access to all SF tools. This is the headless equivalent of the subagent tool. Use this for internal SF workflows (e.g., "dispatch a review subagent that calls complete-task").

6. Naming — The Mental Model

The current names reflect implementation history, not user intent. Here is what they should be:

Current → Proposed

Current	Problem	Proposed	Rationale
`subagent` tool	"subagent" implies a lesser agent, not a dispatch primitive	`dispatch` tool (in TUI)	The tool is the dispatch API surface
`parallel-orchestrator`	"orchestrator" is vague; doesn't convey worktree isolation	`worktree-pool` or `worktree-scheduler`	Conveys the resource model
`slice-parallel-orchestrator`	Duplicate of above	Merge into `worktree-pool`	See section 2
`UOK kernel`	"kernel" implies OS-level; "UOK" is jargon	`autonomous-runtime` or `UOK` stays if we accept the acronym	"UOK" means "unit-of-work kernel" internally; can keep if documented
`MessageBus`	Generic; doesn't convey durability	`agent-inbox` (but it's more than a bus)	Actually, `MessageBus` is fine — it is a bus pattern. Keep it.
`Cmux`	"cmux" is implementation detail of terminal multiplexing	`surface-grid`	User-facing concept: "show agents in a grid"

The Unified Naming Hierarchy

dispatch          — The high-level API and TUI tool name
  ├── work()       — Run a single unit (milestone/slice/task)
  ├── batch()      — Run multiple units in parallel (worktree pool)
  ├── chain()      — Run units sequentially, passing output
  ├── debate()     — Run units as adversarial roles
  └── subscribe() — Listen to dispatch events

worktree-pool     — The backend that manages worktree lifecycle
autonomous-runtime — The PDD-gated execution loop (UOK kernel)
MessageBus         — Durable inter-agent messaging

7. Implementation Priority

This is a large refactor. The work should be sequenced to avoid breaking the current system while building the new one underneath.

Phase 1 — Foundation (Weeks 1-3)

Goal: Establish the UDA backbone without changing existing behavior.

Task	Why	Risk
Extract a minimal `dispatch-worktree` module from `parallel-orchestrator.js` that just manages worktree lifecycle (create/remove/heartbeat)	The worktree management is the most isolated piece and the easiest to extract first	Low
Add MessageBus subscriptions to `dispatch-worktree` for worker status (replacing session status file polling)	MessageBus already exists; this just redirects the existing file-based IPC	Low
Create `dispatch-chain` module that takes an array of `{ unit, afterId }` and runs them sequentially, passing output	Reuses worktree-pool; no new parallelism semantics	Low
Do NOT change subagent tool or parallel-orchestrator yet	These must keep working while foundation is laid	—

Phase 2 — Merge (Weeks 4-6)

Goal: Eliminate duplication, keep external behavior identical.

Task	Why	Risk
Merge `slice-parallel-orchestrator.js` into `dispatch-worktree` as `scope: 'slice'` parameter	90% code duplication; this is a pure refactor	Medium
Replace `parallel-orchestrator`'s file-based IPC with MessageBus subscriptions	Changes the coordination mechanism but not the external API	Medium
Add `dispatch-batch()` that calls `dispatch-worktree` for N units	Reuses the same worktree pool; just adds the batch interface	Low
Verify all existing parallel orchestrator tests still pass	Regression protection	Low

Phase 3 — Subagent RPC Mode (Weeks 7-8)

Goal: Subagent gains headless RPC access without spawning CLI.

Task	Why	Risk
Add `dispatch.rpc()` — spawn a headless RPC client (not CLI) for a subagent	The 4-tool limitation goes away when subagent is an RPC client	Medium
Wire `subagent({ mode: 'rpc' })` to use `dispatch.rpc()`	Subagent keeps its 4-mode interface; the implementation changes	Medium
Ensure subagent RPC mode cannot access tools the parent mode doesn't permit	Security boundary must be preserved	Medium

Phase 4 — UOK as Execution Mode (Weeks 9-10)

Goal: UOK kernel becomes a dispatch execution mode, not a separate process.

Task	Why	Risk
Refactor `runAutoLoopWithUok` to be `dispatch.autonomous()` — a UDA execution mode	The autonomous loop becomes a configuration of dispatch, not a separate entry point	Medium
`sf headless autonomous` calls `dispatch.batch()` with UOK runtime per slot	The headless binary becomes a thin launcher for the dispatch service	Medium
Remove the notion of "UOK kernel" as a separate coordination entity	The kernel is an execution context; coordination is dispatch's job	Medium

Phase 5 — Cmux Decoupling (Week 11)

Goal: Cmux becomes a MessageBus subscriber, not a dispatch-aware component.

Task	Why	Risk
Make Cmux grid layout creation driven by MessageBus events, not by dispatch calling Cmux directly	Dispatch should not know about terminal surface implementation	Low
Remove `cmuxSplitsEnabled` from subagent tool	This is the concrete coupling point — dispatch knows about Cmux grid layouts	Low

Phase 6 — Naming Cleanup (Week 12)

Goal: Rename things to match the mental model once the refactor is stable.

Task	Why	Risk
Rename `subagent` tool to `dispatch` in the TUI (keep `subagent` as alias)	User-facing naming should match the mental model	Low
Rename `parallel-orchestrator` file to `worktree-pool.js`	Internal naming	Low
Document the architecture in `ARCHITECTURE.md`	The current dispatch docs are scattered	Low

Summary

The 5 dispatch mechanisms + 1 message bus represent 3 genuinely different needs (UOK autonomous loop, worktree-based isolation, durable inter-agent messaging) and 3 duplications (parallel-orchestrator + slice-parallel-orchestrator; subagent parallel mode + parallel-orchestrator; Cmux tight coupling). The root cause is that dispatch, orchestration, and coordination evolved separately rather than being designed as layers of one system.

The plan is to:

Merge parallel-orchestrator + slice-parallel-orchestrator into a single WorktreePool
Make subagent an RPC client of a unified Dispatch service, not a spawned CLI
Make UOK an execution mode of the dispatch service, not a separate process
Make MessageBus the event backbone replacing all file-based IPC
Decouple Cmux from dispatch entirely (it subscribes to MessageBus)
Sequence the refactor so existing behavior is preserved at each step

22 KiB Raw Blame History