Mikael Hugo d7c2663ca5 sf snapshot: uncommitted changes after 113m inactivity

2026-05-08 17:44:49 +02:00

23 KiB

Raw Blame History

Dispatch/Orchestration Architecture — Consolidation Plan

Author: Research synthesis
Date: 2026-05-08
Status: Draft — for review and promotion

1. Root Cause Diagnosis

The 5 dispatch mechanisms + 1 message bus grew to fill genuine gaps at different stages, but the structural symptom is a missing abstraction layer: there is no unified concept that unifies "what to run" (controller) from "how to run" (mechanism).

Timeline of divergence

Era	Mechanism	Gap filled
Early SF	`subagent tool` (`extensions/subagent/index.js`)	Ad-hoc delegation: "run this agent for this task" from within a session
Parallel work	`parallel-orchestrator.js`	"run milestone X in a worktree, independently" — process isolation via `spawn('sf headless')`
Slice-level work	`slice-parallel-orchestrator.js`	Same as above but at slice granularity — 90% copy-paste of parallel-orchestrator
Autonomous loop	`UOK kernel` (`uok/kernel.js`)	"run the full PDD loop continuously, gated by confidence/risk"
Multi-agent messaging	`MessageBus` (`uok/message-bus.js`)	"agents need to communicate across turns/sessions" (Letta-style)
Surface multiplexing	`Cmux` (`cmux/index.js`)	"TUI needs multiple visible surfaces for parallel agents"

Three structural problems

1. Single-process thinking drove process-per-unit.
SF was originally a single-agent CLI. When parallelism was needed, the natural answer was spawn('sf headless') — a new OS process per milestone. This is correct for filesystem isolation but requires bolted-on coordination (SQLite WAL, file-based IPC, session status polling).

2. UOK kernel was designed as a single-agent loop.
It runs inside a headless process and manages one autonomous run. It does not know about sibling workers spawned by parallel-orchestrator, does not coordinate with it, and has no model for "I am one of N workers running concurrently."

3. MessageBus was designed for persistent agents SF doesn't have yet.
The Letta-style inbox model is architecturally correct but premature — you need durable named agents before durable named inboxes matter. Today MessageBus is used for UOK internal observer chains but not for real multi-agent coordination between workers and coordinator.

The concretion

The proliferation is a missing abstraction problem at three levels:

No unified "dispatch context" — subagent, parallel-orchestrator, and UOK each create their own notion of "what am I running and with what environment"
No shared dispatch registry — no single place that tracks "what is currently running" across all parallelism dimensions
No first-class "work unit" concept — milestone, slice, and task are different tables with different lock semantics, not different states of the same work unit

2. What Should Stay vs Merge

Stay (genuinely different needs)

Mechanism	Reason to Keep
UOK kernel	Autonomous loop engine implementing the PDD gate model (confidence/risk/reversibility/blast-radius/cost). This is the controller — it decides what to run, not how. Removing it means rewriting autonomous mode from scratch.
MessageBus	SQLite-backed durable inbox is the right model for cross-turn coordination. This is genuine infrastructure. However: it should be repurposed — it serves UOK diagnostics today and should serve agent handoff when persistent agents land (v3.1 per BUILD_PLAN.md).
Cmux	Terminal UI surface multiplexing. Belongs in `pi-tui`, not the dispatch layer. Should be decoupled from dispatch entirely — parallel orchestrator should not know about Cmux grid layouts.
Execution graph (`uok/execution-graph.js`)	File-conflict DAG that computes which milestones/slices can run in parallel. This is the constraint solver — stays separate from dispatch mechanism.

Merge (duplication without functional difference)

Duplicated	Problem	Resolution
`parallel-orchestrator.js` + `slice-parallel-orchestrator.js`	~80% identical. Only diffs: scope (milestone vs slice), lock env vars, status file naming. Slice orchestrator additionally calls `slice-parallel-conflict.ts` for file overlap filtering.	Merge into a single `WorktreeOrchestrator` parameterized by `{ scope: 'milestone'
subagent tool's parallel/debate/chain modes vs parallel-orchestrator's milestone workers	Both implement "run multiple things at the same time." Subagent tool does in-process `Promise.all` over spawned `sf` CLIs; parallel-orchestrator does the same with worktrees. Different IPC mechanisms, different isolation models.	Subagent keeps single-agent dispatch (its core value). For multi-agent work, subagent should delegate to the unified orchestrator rather than managing its own concurrency pool.

Refactor (same need, wrong implementation)

Current	Issue	Refactor
subagent spawning `sf` CLI	Thin wrapper spawning a full binary. Only 4 tools registered as a workaround for not having a proper dispatch API.	Subagent should use a headless RPC client directly, not spawn `sf`. Enables calling any SF tool, not just 4 registered ones.
parallel/slice orchestrator using SQLite WAL + file IPC	Hand-rolled IPC via session status files + signal files. "Poll the filesystem" coordination — correct but fragile.	Replace with MessageBus-based coordination. Workers publish status to MessageBus; coordinator subscribes.
UOK kernel owning the autonomous loop	Runs inside a headless process. When parallel orchestrator spawns `sf headless autonomous`, each worker has its own UOK kernel with no coordination between kernels.	UOK kernel should be the runtime environment for any autonomous dispatch, not a process-bound concept.

3. Streamlined Architecture

Three-tier dispatch model

┌──────────────────────────────────────────────────────────────────┐
│  UOK Kernel (controller)                                          │
│  Decides WHAT to run next; enforces PDD gates, policy, parity     │
│  - Phase machine: Discuss → Plan → Execute → Merge → Complete      │
│  - Calls WorktreeOrchestrator.dispatch() to execute                 │
└────────────────────────────┬──────────────────────────────────────┘
                             │ DispatchEnvelope { scope, unitId, ... }
                             ▼
┌──────────────────────────────────────────────────────────────────┐
│  WorktreeOrchestrator (mechanism)                                   │
│  Decides HOW to run: worktree lifecycle, process registry, budget   │
│  - Worktree pool (git worktree per milestone/slice)                │
│  - Process registry (child_process per worker)                      │
│  - Cost accumulator (NDJSON parsing from worker stdout)             │
│  - File-intent tracker (parallel-intent.js)                        │
│  - MessageBus integration per worker (AgentInbox)                   │
└────────────────────────────┬──────────────────────────────────────┘
                             │ spawns
                             ▼
┌──────────────────────────────────────────────────────────────────┐
│  Worker (execution unit)                                           │
│  `sf headless --json autonomous` in a worktree                    │
│  - Owns SQLite WAL connection to project DB                        │
│  - Has AgentInbox for MessageBus delivery                          │
│  - Emits NDJSON events consumed by WorktreeOrchestrator            │
└──────────────────────────────────────────────────────────────────┘

How existing components map

Component	Role in Unified Architecture
subagent tool	Thin UDA client in TUI. Single-agent dispatch with full SF tool access. Keeps 4-mode interface (single/parallel/debate/chain) but implemented via UDA, not spawned CLI.
parallel-orchestrator + slice-parallel	Merge into WorktreeOrchestrator — a UDA backend managing worktree lifecycle and multi-slot execution.
UOK kernel	Becomes autonomous-runtime — a UDA execution mode wrapping any dispatch with the PDD gate model. `dispatch.work({ unit, runControl: 'autonomous' })` automatically uses the autonomous-runtime.
MessageBus	Becomes the UDA event/logging backbone. All dispatch events (start, end, error, cost) published to MessageBus. File-based IPC replaced by MessageBus subscriptions.
Cmux	Decoupled entirely. Cmux listens to MessageBus for dispatch events and renders grid layouts. Dispatch layer does not know about Cmux.

WorktreeOrchestrator interface (proposed)

// File: src/resources/extensions/sf/worktree-orchestrator.js

interface DispatchOptions {
  scope: 'milestone' | 'slice';
  milestoneId: string;
  sliceId?: string;
  basePath: string;
  maxWorkers?: number;
  budgetCeiling?: number;
  workerTimeoutMs?: number;
  shellWrapper?: string[];
  useExecutionGraph?: boolean;
}

class WorktreeOrchestrator {
  // Returns eligible units filtered by execution-graph conflicts
  async prepare(opts: DispatchOptions): Promise<PrepareResult>;

  // Start workers for given unit IDs
  async start(ids: string[], opts: DispatchOptions): Promise<StartResult>;

  // Stop all or specific workers
  async stop(ids?: string[]): Promise<void>;

  // Pause/resume workers via MessageBus
  pause(ids?: string[]): void;
  resume(ids?: string[]): void;

  // Read current state (for dashboard)
  getStatus(): DispatchStatus;

  // Shared MessageBus instance
  readonly bus: MessageBus;

  // Budget tracking
  totalCost(): number;
  isBudgetExceeded(): boolean;
}

How UOK kernel uses WorktreeOrchestrator

Today, uok/kernel.js runs the autonomous loop and calls into tools that spawn agents. The parallel orchestrator is started separately by the TUI dashboard or headless command. After unification:

UOK kernel initializes WorktreeOrchestrator at autonomous loop start
UOK calls orchestrator.start(eligibleMilestoneIds) for parallel milestones
Workers emit NDJSON events → orchestrator parses cost → updates budget
Workers emit completion → UOK kernel processes post-unit staging
Workers receive messages via their AgentInbox (MessageBus integration)
orchestrator.stop() called on autonomous loop exit

4. Multi-Dimensional Parallelism

Current axes of parallelism

Axis	Mechanism	Status
Inter-project	Multiple `sf` invocations	✅ not SF's concern
Inter-milestone	parallel-orchestrator + worktrees	✅ implemented
Inter-slice	slice-parallel-orchestrator + worktrees	✅ implemented
Inter-task (in-process)	subagent `parallel` mode	✅ `mapWithConcurrencyLimit`
Inter-agent (debate/chain)	subagent `debate`/`chain` mode	✅ implemented
Terminal-level	Cmux grid layout for parallel agents	✅ implemented

What "true concurrency" means

The current architecture already achieves true process-level concurrency via worktrees and separate sf headless processes. The shared SQLite WAL allows concurrent readers with a single writer.

What is missing is coordinated dispatch, not more parallelism axes:

The execution graph (uok/execution-graph.js) already computes file-conflict relationships
selectConflictFreeBatch picks a conflict-free subset for parallel dispatch
But this is only wired into parallel-orchestrator, not into slice-parallel or the UOK autonomous loop's dispatch decisions

Proposed coordination model

Execution Graph (file-conflict DAG)
    │
    ├── selectConflictFreeBatch() ──► WorktreeOrchestrator.start()
    │                                  Workers run in parallel
    │                                  Each worker has AgentInbox
    │
UOK kernel
    │
    ├── reads unit readiness from DB
    ├── calls WorktreeOrchestrator.start(milestoneIds)
    └── calls WorktreeOrchestrator.start(sliceIds) for intra-milestone parallelism

Debate mode (subagent tool): runs multiple agents sequentially within a single process using mapWithConcurrencyLimit. This is not true process-level parallelism but is correct for LLM-based debate where shared context and a single conversation transcript are needed.

Chain mode: purely sequential — each step's output feeds into the next step's prompt.

Concurrency limits

Level	Max Concurrent
Project (milestones)	`parallel.max_workers` config (default: CPU cores / 2)
Milestone (slices)	`parallel.slice_max_workers` config (default: 2)
Subagent parallel tasks	`MAX_CONCURRENCY = 4` (hardcoded in `subagent/index.js`)

5. DB Access from Subagents

The current constraint is intentional

The subagent tool cannot call complete-task or plan-slice because:

Only 4 tools are registered in the subagent extension manifest
The subagent is meant to be a task executor, not a state mutator

This is a correct security isolation, not a bug. A spawned sf CLI with full SF tool access running in a user-specified cwd is a significant attack surface.

The right model: two-tier DB access

Coordinator (UOK kernel)     ──►  project .sf/sf.db (WAL mode)
                                    milestone/slice state
                                    task execution ledger

Subagent (sf process)        ──►  ~/.sf/sf.db (global)
                                    memories, preferences
                                    agent-level state
                              ✗    project .sf/sf.db (write)

The subagent can read project state via prompt injection (system context assembly already does this). Writes only to global state.

If a subagent needs to record a finding

Subagent writes to its output (stdout/file)
Coordinator reads and processes the output
Coordinator calls DB tools

This is the Letta pattern — agents return results, the orchestrator decides what to persist.

Architectural backing for the constraint

// In subagent tool — formalize the access contract
const SUBAGENT_DB_ACCESS = {
  read: ['project_context'],    // via prompt injection only
  write: ['~/.sf/sf.db'],       // global state only
  prohibited: ['project .sf/sf.db write operations']
};

The extension manifest's tools[] array currently enforces this by omission. A more explicit model would declare the access contract formally, making it auditable.

Future: RPC-mode subagent

Keep the current constraint for spawned-CLI subagents.

Add a new subagent mode — dispatch.work({ mode: 'rpc' }) — where the subagent runs as an RPC client in the same process, gaining access to all SF tools. This is the headless equivalent of the subagent tool. Use this for internal SF workflows (e.g., "dispatch a review subagent that calls complete-task").

6. Naming — The Mental Model

Current → Proposed

Current	Problem	Proposed	Rationale
`subagent` tool	"subagent" implies a lesser agent	`dispatch` tool (in TUI)	The tool is the dispatch API surface
`parallel-orchestrator`	"orchestrator" is vague; doesn't convey worktree isolation	`milestone-dispatcher`	Conveys scope + role
`slice-parallel-orchestrator`	Duplicate of above	`slice-dispatcher` (merged into WorktreeOrchestrator)	See section 2
`WorktreeOrchestrator` (new)	—	`worktree-orchestrator`	Backend that manages worktree lifecycle
`UOK kernel`	"kernel" implies OS-level; "UOK" is jargon	`autonomous-runtime`	PDD-gated execution loop
`MessageBus`	Generic	keep as-is	It is a bus pattern. Keep it.
`Cmux`	"cmux" is implementation detail	`surface-grid`	User-facing: "show agents in a grid"

The mental model hierarchy

dispatch           — The high-level API and TUI tool name
  ├── work()       — Run a single unit (milestone/slice/task)
  ├── batch()      — Run multiple units in parallel (worktree pool)
  ├── chain()      — Run units sequentially, passing output
  ├── debate()     — Run units as adversarial roles
  └── subscribe()  — Listen to dispatch events (MessageBus)

worktree-orchestrator  — Backend: worktree lifecycle + process registry
autonomous-runtime     — PDD gate model (UOK kernel, renamed)
MessageBus             — Durable inter-agent messaging (keeps name)
surface-grid           — Cmux decoupled from dispatch

Why "kernel" is the right metaphor for UOK (keep it internally):
A kernel manages resources and enforces policy; it doesn't do the work itself. The UOK kernel evaluates confidence/risk gates, manages parity reporting, and decides when to proceed — but it delegates execution to WorktreeOrchestrator. The name fits.

7. Implementation Priority

Phase 1 — Merge the two orchestrators (Lowest risk, highest clarity)

1.1 — Extract WorktreeOrchestrator from parallel-orchestrator + slice-parallel

Create new dispatch-layer.js that merges the ~80% shared logic. Parameterized by { scope: 'milestone' | 'slice' }. The slice orchestrator's conflict-filtering logic (filterConflictingSlices in slice-parallel-conflict.ts) already lives separately — call it from the merged class.

Files touched:

New: src/resources/extensions/sf/dispatch-layer.js
Refactor: parallel-orchestrator.js → thin wrapper calling dispatch-layer
Refactor: slice-parallel-orchestrator.js → thin wrapper calling dispatch-layer

Test: Both /parallel command and slice-level parallelism continue to work identically. Dashboard continues to show correct worker states.

Effort: ~1 week. Pure refactor, no behavior change.

Phase 2 — Wire MessageBus into WorktreeOrchestrator

2.1 — Add AgentInbox to each worker

Every sf headless worker opens a MessageBus inbox named after its milestone/slice ID. The coordinator can send messages to workers (pause, resume, report status).

2.2 — Replace file-based IPC with MessageBus

Replace session-status-io.js polling and sendSignal file-based signals with MessageBus send(). File-based signals remain as crash-recovery fallback.

Files touched:

dispatch-layer.js (new)
session-status-io.js (add MessageBus-backed path)
Worker bootstrap in both orchestrators

Test: Workers respond to coordinator pause/resume messages delivered via MessageBus.

Effort: ~3 days.

Phase 3 — Subagent RPC Mode

3.1 — Add dispatch.rpc() — spawn a headless RPC client (not CLI)

The 4-tool limitation goes away when subagent is an RPC client. The subagent keeps its 4-mode interface; the implementation changes.

3.2 — Ensure subagent RPC mode cannot access tools the parent mode doesn't permit

Security boundary must be preserved. This is where the access contract from section 5 gets enforced.

Files touched:

extensions/subagent/index.js (add RPC mode path)
extensions/subagent/rpc-client.js (new)

Test: Subagent with mode: 'rpc'} can call complete-task and other SF tools.

Effort: ~1 week.

Phase 4 — UOK Kernel Adopts WorktreeOrchestrator

4.1 — Replace direct parallel-orchestrator calls with WorktreeOrchestrator

The autonomous loop's parallel dispatch path (analyzeParallelEligibility → startParallel) goes through WorktreeOrchestrator instead of calling parallel-orchestrator directly.

4.2 — UOK reads worker status from WorktreeOrchestrator

Dashboard refresh reads from orchestrator.getStatus() instead of directly from parallel-orchestrator's state.

Files touched:

uok/kernel.js (import WorktreeOrchestrator)
parallel-orchestrator.js (becomes wrapper or is removed)

Test: Autonomous mode with parallel milestones works identically to current behavior.

Effort: ~3 days.

Phase 5 — Cmux Decoupling

5.1 — Make Cmux grid layout driven by MessageBus events

Dispatch should not call Cmux directly. Cmux subscribes to MessageBus dispatch events and creates/destroys grid surfaces accordingly.

5.2 — Remove cmuxSplitsEnabled from subagent tool

This is the concrete coupling point — dispatch knows about Cmux grid layouts. Remove it; let Cmux manage its own surface allocation based on dispatch events.

Files touched:

cmux/index.js (add MessageBus subscriber)
extensions/subagent/index.js (remove cmuxSplitsEnabled)

Effort: ~2 days.

Phase 6 — Naming Cleanup

6.1 — Rename dispatch-layer.js → worktree-orchestrator.js
6.2 — Rename parallel-orchestrator wrapper → milestone-dispatcher.js
6.3 — Rename slice-parallel-orchestrator wrapper → slice-dispatcher.js
6.4 — Update all import references

Effort: ~1 day.

Phase 7 — Document the Architecture

7.1 — Update ARCHITECTURE.md
Add a section on the unified dispatch architecture. Current dispatch docs are scattered across inline comments and session-status-io.

Effort: ~1 day.

Summary

The 5 dispatch mechanisms + 1 message bus represent 3 genuinely different needs (UOK autonomous loop, worktree-based isolation, durable inter-agent messaging) and 3 duplications (parallel-orchestrator + slice-parallel-orchestrator; subagent parallel mode + parallel-orchestrator; Cmux tight coupling).

The plan:

Merge parallel-orchestrator + slice-parallel-orchestrator into WorktreeOrchestrator
Wire MessageBus into WorktreeOrchestrator — workers become reachable via durable messaging
UOK kernel becomes the controller that calls WorktreeOrchestrator, not a parallel system
Subagent tool stays separate — it's ad-hoc in-session delegation, with an optional RPC mode for internal workflows
Cmux becomes a MessageBus subscriber, decoupled from dispatch
DB access model is already correct: spawned subagents cannot write to project DB; workers dispatched via WorktreeOrchestrator can

The adversarial_partner/adversarial_combatant/adversarial_architect fields already in the DB are planning ceremony fields (Letta-inspired), not dispatch mechanism fields. They belong in the PDD planning layer, not in the dispatch layer.

Total effort estimate: 3-4 weeks across 7 phases, sequenced to preserve existing behavior at each step.

23 KiB Raw Blame History

Dispatch/Orchestration Architecture — Consolidation Plan

1. Root Cause Diagnosis

Timeline of divergence

Three structural problems

The concretion

2. What Should Stay vs Merge

Stay (genuinely different needs)

Merge (duplication without functional difference)

Refactor (same need, wrong implementation)

3. Streamlined Architecture

Three-tier dispatch model

How existing components map

WorktreeOrchestrator interface (proposed)

How UOK kernel uses WorktreeOrchestrator

4. Multi-Dimensional Parallelism

Current axes of parallelism

What "true concurrency" means

Proposed coordination model

Concurrency limits

5. DB Access from Subagents

The current constraint is intentional

The right model: two-tier DB access

If a subagent needs to record a finding

Architectural backing for the constraint

Future: RPC-mode subagent

6. Naming — The Mental Model

Current → Proposed

The mental model hierarchy

7. Implementation Priority

Phase 1 — Merge the two orchestrators (Lowest risk, highest clarity)

Phase 2 — Wire MessageBus into WorktreeOrchestrator

Phase 3 — Subagent RPC Mode

Phase 4 — UOK Kernel Adopts WorktreeOrchestrator

Phase 5 — Cmux Decoupling

Phase 6 — Naming Cleanup

Phase 7 — Document the Architecture

Summary

23 KiB

Raw Blame History