Mikael Hugo 1412eac60a fix(sf): harden exit and worktree cleanup

2026-05-02 09:30:14 +02:00

14 KiB

Raw Permalink Blame History

ADR-019: Workspace VM Convergence Architecture

Status: Proposed Date: 2026-05-01 Revised: 2026-05-02 — wire-format scope superseded by ADR-020 Deciders: Mikael Hugo Context repos: singularity-forge (SF), ace-coder (ACE)

Cross-repo note: The matching document in ACE-coder lives at docs/architecture/sf-ace-convergence.md. Both documents describe the same architecture from each codebase's perspective. Keep them in sync when either changes.

Context

Two autonomous agent systems are being developed in parallel:

SF (singularity-forge) — TypeScript orchestrator. Works today. Dispatches Claude Code sessions as ephemeral units (milestone → slice → task). Isolation via git worktrees. Single-machine, single-user, single-repo by design. That scope is its character, not a limitation. SF stays a standalone app permanently; it does not grow into a platform.
ACE (ace-coder) — Python platform. Partially operational. HTDAG execution backbone, Project Manager ownership, 20 defined agent personas, LiteLLM multi-provider, RBAC, PGMQ task queue, tiered memory. Multi-tenant data model (tenant_id) exists; per-task execution isolation does not. ACE is where multi-tenant, multi-repo, federated workloads live.
singularity-memory — Separate Go service (migrating from Python per ADR-014). Postgres + vchord vector store. Federated knowledge layer.
- Internal consumers (SF, ACE, future first-party services) talk to it via typed direct clients (HTTP/gRPC generated from the Go API). No MCP, no JSON-RPC framing, no protocol cost.
- External coding tools (Claude Code, Cursor, third-party LLM clients) get an MCP façade. This is a temporary scaffold so external coders can read/write memory while they help build the system; it is not the production wire for internal services and is expected to shrink once the system is self-hosting.

The two systems are not converging into one app. They occupy different niches:

SF is the local single-user developer tool — fast, generic, runs on the developer's machine on whatever repo they're working on.
ACE is the multi-tenant platform — federated, multi-repo, scales beyond one user.

Convergence in this ADR refers to shared substrate, not application merging: shared wire schemas (singularity-grpc), shared execution isolation primitive (Firecracker workspaces) when SF chooses to dispatch into one. SF can live entirely on its own without ACE; ACE doesn't depend on SF.

The strategy is incremental pattern transfer: SF continues to work as a standalone single-user tool while autonomously helping build out ACE. ACE ports proven patterns from SF as it matures. SF gains an optional engine adapter for dispatching units into ACE workspaces when multi-tenant or multi-repo work is needed. Neither replaces the other.

Decision

The unifying primitive: Workspace

workspace = VM (microVM) + tenant_id + [repo_1, repo_2, ...] + scoped_credentials

A workspace is the execution isolation unit for both systems. It replaces:

SF's git worktree per milestone
ACE's process-level execution/worker.py per task

A workspace is:

A microVM (Firecracker) — hard process/filesystem/network isolation at the hypervisor level. Firecracker was built by AWS specifically for multi-tenant Lambda; it provides the isolation both systems need without reimplementing it.
Tenant-scoped — maps to ACE's existing tenant_id on agent_memory and task_queue. The VM boundary is the enforcement point; the database tenant_id is the tracking point.
Multi-repo — the orchestrator tells the VM which repos to clone/mount. The VM operates across all of them. Cross-repo work is trivially a list.
Credential-scoped — per-workspace credentials (git tokens, API keys) are injected at VM start and destroyed at VM exit. Never shared across tenants.
Snapshot/restore — VM state snapshots replace .sf/paused-session.json and ACE's checkpoint_service. A "persistent agent" is a named snapshot: restore it, the agent wakes with full memory and context intact.

The three-layer architecture

┌─────────────────────────────────────────────────────────────────────┐
│  Orchestration layer                                                 │
│                                                                      │
│  Near-term: SF (TS state machine) dispatches to workspace VMs       │
│  Long-term: ACE HTDAG/PM becomes the orchestration backbone;        │
│             SF state machine becomes an ACE PM persona              │
│                                                                      │
│  Language: TS (SF today) → Python (ACE, when reliable)             │
├─────────────────────────────────────────────────────────────────────┤
│  Knowledge layer                                                     │
│                                                                      │
│  singularity-memory: Go + Postgres + vchord                         │
│  Internal services (SF, ACE) use typed direct clients (HTTP/gRPC).  │
│  External coding tools (Claude Code, Cursor) use an MCP façade —    │
│  temporary scaffold while external coders help build the system.    │
│  Tenant-scoped knowledge banks (to be designed — see below).        │
│                                                                      │
│  Language: Go (ADR-014 migration, phases 0–3 only — NOT phase 4)   │
├─────────────────────────────────────────────────────────────────────┤
│  Execution layer                                                     │
│                                                                      │
│  workspace = VM + tenant + repos + credentials                      │
│  One workspace per dispatch unit.                                   │
│  VM management shim: Rust (Firecracker API is Rust-native).         │
│  Agent runtime inside VM: whatever the task requires.               │
│                                                                      │
│  Language: Rust (VM shim) + anything (inside VM)                   │
└─────────────────────────────────────────────────────────────────────┘

ADR-014 Phase 4 is reassigned

ADR-014 proposed building a "central persistent-agent runtime" in Go using charmbracelet/fantasy. This is cancelled. Persistent agents live as VM snapshots managed by ACE's orchestration layer — not as a separate Go runtime. singularity-memory (Go) scopes to the knowledge layer only (ADR-014 phases 0–3).

Multi-tenant design

What exists (ACE):

tenant_id on agent_memory, task_queue ✅
RBAC (rbac_capability_policy.py, agent permission levels) ✅
PM-driven governance and approval gates ✅

What needs to be built:

Tenant-scoped knowledge banks in singularity-memory (each tenant's memory is isolated; cross-tenant sharing requires explicit federation grants)
VM pool with per-tenant resource quotas (CPU, RAM, disk, LLM token budget)
Cost accounting: LLM gateway already tracks per-worker; extend with tenant_id
Headscale ACL rules per tenant (each tenant's VMs on their own tailnet ACL)

The enforcement model:

tenant_id (DB) + VM boundary (hypervisor) + Headscale ACL (network) = full isolation

No single layer is sufficient alone. All three enforce the same boundary from different angles.

Multi-repo design

The workspace repo list:

// SF orchestrator side
type WorkspaceSpec = {
  tenant_id: string;
  repos: Array<{ url: string; ref: string; mount: string }>;
  credentials: CredentialRef[];
  snapshot_id?: string;  // resume from saved state
};

Cross-repo artifact DAG (new primitive, not in either system yet):

When a task produces artifacts spanning multiple repos, HTDAG needs to track which commits in which repos constitute "done". This is the cross-repo task graph — probably a new node type in HTDAG's DAG structure. Design deferred until the workspace VM primitive is stable.

MCP scope

Superseded by ADR-020: This section's proposal to use MCP for internal service wires is replaced. ADR-020 mandates gRPC for first-party services (SF, ACE, memory). MCP is reserved for external coding tools (Claude Code, Cursor) only. The original analysis below is preserved as historical context.

[Originally proposed: MCP for internal services — superseded by ADR-020 in favor of gRPC.] Internal services use typed direct clients (gRPC for first-party). MCP is reserved for external coding tools (Claude Code, Cursor) that don't share our build system. See ADR-020 for the full wire-format table and rationale.

Incremental convergence path

Phase 1 — SF continues, ACE gets built (now)

SF runs autonomous milestones on ace-coder. No changes to SF.
ACE develops its HTDAG, PM, and worker primitives independently.
Both systems mature on their own tracks.

Phase 2 — Federated memory for ACE (near-term, ADR-012 Tier 1)

ACE connects to singularity-memory via a typed Python client (generated from the Go API — not MCP). Internal services do not pay the MCP tax. [Wire format confirmed by ADR-020: gRPC for first-party services.]
SF stays local. SF is single-machine, single-user, local-first by design. memory-store.ts continues to work on .sf/memory/; no remote mode wired in SF core. When SF runs inside an ACE-managed workspace, the workspace surfaces federated context through the ACE engine adapter as additional KNOWLEDGE injection — SF doesn't know that's where it came from. Federation is an ACE concern, not a SF concern.
The MCP façade on singularity-memory is reserved for external coding tools (Claude Code, Cursor) that need to read/write memory while helping build the system. Temporary scaffold; not a production wire.
Outcome: federated knowledge layer operational for ACE; SF unchanged and unaware of memory federation infrastructure.

Phase 3 — Workspace VM opt-in for SF (medium-term)

Build sf-workspace shim: thin Rust binary that manages Firecracker VMs.
SF's runUnit() dispatches to workspace VM instead of raw Claude Code session when project preference workspace.isolation: "vm" is set.
Git worktree path remains for projects that haven't opted in.
Outcome: SF can run multi-repo and multi-tenant workloads experimentally.

Phase 4 — ACE workers → workspace VMs (parallel to Phase 3)

ACE's execution/worker.py (async task pool) gains workspace VM dispatch path.
ACE fills the explicit gap noted in its own competitive analysis: "No per-task container sandboxing... ACE has process-level sandboxing only."
RBAC + tenant_id at data layer + VM at execution layer = full multi-tenant ACE.
Outcome: ACE can handle multi-tenant, multi-repo workloads.

Phase 5 — Shared workspace protocol

SF and ACE converge on the same WorkspaceSpec wire format.
SF's orchestrator can dispatch to ACE's worker pool (and vice versa).
The sf-workspace shim and ACE's VM dispatch path are the same binary.
Outcome: two orchestrators, one execution substrate.

Phase 6 — Pattern transfer (long-term)

SF remains a separate, standalone app — permanently. It is not absorbed, re-platformed, or re-implemented inside ACE. The convergence is at the wire and execution-substrate layers (Phases 3–5), not at the application layer.

What Phase 6 actually means:

ACE ports proven patterns from SF — idempotency primitives, state-derivation discipline, the structured notification model, the watchdog pattern, project preferences as a config layer, scaffold-as-contract. These become ACE's own primitives, written in Python, owned by ACE.
SF stays single-machine, single-user, local-first — its character. SF gets generally better as a standalone tool: better project detection, cleaner engine adapter extension point, harder-tested crash recovery.
SF and ACE remain independent runtimes. SF can be dispatched into an ACE workspace (Phase 5) for multi-tenant or multi-repo work, but it is also fully usable on its own with no ACE present.
Outcome: two distinct apps that share wire schemas (singularity-grpc) and optionally an execution substrate (Firecracker). Neither replaces the other.

What is NOT in scope for this ADR

Cross-tenant knowledge federation (single trust domain per deployment for now)
Public-internet exposure (tailnet-only, per ADR-013)
Replacing SF's state machine before Phase 6 — it works, don't touch it
Choosing the agent runtime inside the VM — language-agnostic by design
Cross-repo artifact DAG implementation — deferred to after Phase 3

Risks

Risk	Mitigation
Firecracker cold-start latency (~125ms) is too slow for short SF tasks	Keep git-worktree path as fallback; VMs for tasks >5min
VM snapshot size grows unboundedly for persistent agents	Snapshot rotation policy, same as activity log retention
ACE HTDAG not stable enough for Phase 5	Phase 5 is gated on ACE reliability, not a timeline. SF works fine until then.
singularity-memory Go migration stalls	Phase 2 can use the Python server; migration is not on the critical path
Cross-repo DAG design takes longer than expected	Phases 1–5 work without it; single-repo workspace is the common case

References

SF docs/dev/ADR-012-multi-instance-federation.md — federation surfaces
SF docs/dev/ADR-013-network-and-remote-execution.md — tailnet + SSH workers
SF docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md — Go migration (phases 0–3 only)
ACE docs/architecture/sf-ace-convergence.md — this ADR from ACE's perspective
ACE ARCHITECTURE.md §Sandbox — "No per-task container sandboxing" gap
ACE docs/architecture/data-and-storage.md — tenant_id schema
Firecracker — microVM hypervisor

14 KiB Raw Permalink Blame History Unescape Escape