docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs (Firecracker) as the unified execution isolation primitive, the three-layer architecture (orchestration/knowledge/execution), the 6-phase convergence path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned to ACE). Cross-references the matching ACE document at docs/architecture/sf-ace-convergence.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 23:21:23 +02:00 · 2026-05-01 23:21:23 +02:00 · 0976bbbb83
commit 0976bbbb83
parent 10936277a5
2 changed files with 231 additions and 0 deletions
--- a/docs/design-docs/index.md
+++ b/docs/design-docs/index.md
@ -23,6 +23,7 @@ in `docs/dev/`. Lighter design docs (problem framing, event model decisions) liv
 | [ADR-016](../dev/ADR-016-charm-ai-stack-adoption.md) | Charm AI Stack Adoption | Proposed |
 | [ADR-017](../dev/ADR-017-charm-tui-client.md) | Charm TUI Client | Proposed |
 | [ADR-018](../dev/ADR-018-repo-native-harness-evolution.md) | Repo-Native Harness Evolution | Proposed — staged impl |
+| [ADR-019](../dev/ADR-019-workspace-vm-convergence.md) | Workspace VM Convergence — SF↔ACE incremental convergence via microVM execution layer | Proposed |

 ## Design Docs (this directory)

--- a/docs/dev/ADR-019-workspace-vm-convergence.md
+++ b/docs/dev/ADR-019-workspace-vm-convergence.md
@ -0,0 +1,230 @@
+# ADR-019: Workspace VM Convergence Architecture
+
+**Status:** Proposed
+**Date:** 2026-05-01
+**Deciders:** Mikael Hugo
+**Context repos:** `singularity-forge` (SF), `ace-coder` (ACE)
+
+> **Cross-repo note:** The matching document in ACE-coder lives at
+> `docs/architecture/sf-ace-convergence.md`. Both documents describe the same
+> architecture from each codebase's perspective. Keep them in sync when either
+> changes.
+
+---
+
+## Context
+
+Two autonomous agent systems are being developed in parallel:
+
+- **SF** (`singularity-forge`) — TypeScript orchestrator. Works today. Dispatches
+  Claude Code sessions as ephemeral units (milestone → slice → task). Isolation
+  via git worktrees. Single-repo, single-user.
+
+- **ACE** (`ace-coder`) — Python platform. Partially operational. HTDAG execution
+  backbone, Project Manager ownership, 20 defined agent personas, LiteLLM
+  multi-provider, RBAC, PGMQ task queue, tiered memory. Multi-tenant data model
+  (`tenant_id`) exists; per-task execution isolation does not.
+
+- **singularity-memory** — Separate Go service (migrating from Python per ADR-014).
+  Postgres + vchord vector store. Federated knowledge layer shared across SF, ACE,
+  Claude Code, Cursor, and other tools over MCP.
+
+Both systems share the same end destination but are approaching it from different
+directions. SF is production-reliable but architecturally constrained (single-repo,
+git-worktree isolation). ACE has the right orchestration primitives (HTDAG, PM,
+RBAC, tenant model) but lacks execution isolation and is not yet production-reliable.
+
+The strategy is **incremental convergence**: SF continues to work and delivers value
+while autonomously helping build out ACE. As ACE becomes reliable, SF's dispatch
+model transitions to use ACE's execution substrate. They meet at the workspace VM
+boundary.
+
+---
+
+## Decision
+
+### The unifying primitive: Workspace
+
+```
+workspace = VM (microVM) + tenant_id + [repo_1, repo_2, ...] + scoped_credentials
+```
+
+A **workspace** is the execution isolation unit for both systems. It replaces:
+- SF's git worktree per milestone
+- ACE's process-level `execution/worker.py` per task
+
+A workspace is:
+- **A microVM** (Firecracker) — hard process/filesystem/network isolation at the
+  hypervisor level. Firecracker was built by AWS specifically for multi-tenant
+  Lambda; it provides the isolation both systems need without reimplementing it.
+- **Tenant-scoped** — maps to ACE's existing `tenant_id` on `agent_memory` and
+  `task_queue`. The VM boundary is the enforcement point; the database tenant_id
+  is the tracking point.
+- **Multi-repo** — the orchestrator tells the VM which repos to clone/mount. The
+  VM operates across all of them. Cross-repo work is trivially a list.
+- **Credential-scoped** — per-workspace credentials (git tokens, API keys) are
+  injected at VM start and destroyed at VM exit. Never shared across tenants.
+- **Snapshot/restore** — VM state snapshots replace `.sf/paused-session.json` and
+  ACE's `checkpoint_service`. A "persistent agent" is a named snapshot: restore it,
+  the agent wakes with full memory and context intact.
+
+### The three-layer architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│  Orchestration layer                                                 │
+│                                                                      │
+│  Near-term: SF (TS state machine) dispatches to workspace VMs       │
+│  Long-term: ACE HTDAG/PM becomes the orchestration backbone;        │
+│             SF state machine becomes an ACE PM persona              │
+│                                                                      │
+│  Language: TS (SF today) → Python (ACE, when reliable)             │
+├─────────────────────────────────────────────────────────────────────┤
+│  Knowledge layer                                                     │
+│                                                                      │
+│  singularity-memory: Go + Postgres + vchord + MCP server            │
+│  Serves all consumers (SF, ACE, Claude Code, Cursor) over MCP.     │
+│  Tenant-scoped knowledge banks (to be designed — see below).        │
+│                                                                      │
+│  Language: Go (ADR-014 migration, phases 0–3 only — NOT phase 4)   │
+├─────────────────────────────────────────────────────────────────────┤
+│  Execution layer                                                     │
+│                                                                      │
+│  workspace = VM + tenant + repos + credentials                      │
+│  One workspace per dispatch unit.                                   │
+│  VM management shim: Rust (Firecracker API is Rust-native).         │
+│  Agent runtime inside VM: whatever the task requires.               │
+│                                                                      │
+│  Language: Rust (VM shim) + anything (inside VM)                   │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+### ADR-014 Phase 4 is reassigned
+
+ADR-014 proposed building a "central persistent-agent runtime" in Go using
+`charmbracelet/fantasy`. This is **cancelled**. Persistent agents live as VM
+snapshots managed by ACE's orchestration layer — not as a separate Go runtime.
+singularity-memory (Go) scopes to the knowledge layer only (ADR-014 phases 0–3).
+
+---
+
+## Multi-tenant design
+
+**What exists (ACE):**
+- `tenant_id` on `agent_memory`, `task_queue` ✅
+- RBAC (`rbac_capability_policy.py`, agent permission levels) ✅
+- PM-driven governance and approval gates ✅
+
+**What needs to be built:**
+- Tenant-scoped knowledge banks in singularity-memory (each tenant's memory is
+  isolated; cross-tenant sharing requires explicit federation grants)
+- VM pool with per-tenant resource quotas (CPU, RAM, disk, LLM token budget)
+- Cost accounting: LLM gateway already tracks per-worker; extend with `tenant_id`
+- Headscale ACL rules per tenant (each tenant's VMs on their own tailnet ACL)
+
+**The enforcement model:**
+```
+tenant_id (DB) + VM boundary (hypervisor) + Headscale ACL (network) = full isolation
+```
+
+No single layer is sufficient alone. All three enforce the same boundary from
+different angles.
+
+---
+
+## Multi-repo design
+
+**The workspace repo list:**
+
+```typescript
+// SF orchestrator side
+type WorkspaceSpec = {
+  tenant_id: string;
+  repos: Array<{ url: string; ref: string; mount: string }>;
+  credentials: CredentialRef[];
+  snapshot_id?: string;  // resume from saved state
+};
+```
+
+**Cross-repo artifact DAG (new primitive, not in either system yet):**
+
+When a task produces artifacts spanning multiple repos, HTDAG needs to track which
+commits in which repos constitute "done". This is the **cross-repo task graph** —
+probably a new node type in HTDAG's DAG structure. Design deferred until the
+workspace VM primitive is stable.
+
+---
+
+## Incremental convergence path
+
+### Phase 1 — SF continues, ACE gets built (now)
+- SF runs autonomous milestones on `ace-coder`. No changes to SF.
+- ACE develops its HTDAG, PM, and worker primitives independently.
+- Both systems mature on their own tracks.
+
+### Phase 2 — Federated memory (near-term, ADR-012 Tier 1)
+- Wire `memory-store.ts` remote-mode → singularity-memory HTTP endpoint.
+- SF instances on different machines share learnings.
+- ACE connects to the same singularity-memory endpoint (same MCP wire).
+- **Outcome:** shared knowledge layer operational before execution convergence.
+
+### Phase 3 — Workspace VM opt-in for SF (medium-term)
+- Build `sf-workspace` shim: thin Rust binary that manages Firecracker VMs.
+- SF's `runUnit()` dispatches to workspace VM instead of raw Claude Code session
+  when project preference `workspace.isolation: "vm"` is set.
+- Git worktree path remains for projects that haven't opted in.
+- **Outcome:** SF can run multi-repo and multi-tenant workloads experimentally.
+
+### Phase 4 — ACE workers → workspace VMs (parallel to Phase 3)
+- ACE's `execution/worker.py` (async task pool) gains workspace VM dispatch path.
+- ACE fills the explicit gap noted in its own competitive analysis:
+  *"No per-task container sandboxing... ACE has process-level sandboxing only."*
+- RBAC + tenant_id at data layer + VM at execution layer = full multi-tenant ACE.
+- **Outcome:** ACE can handle multi-tenant, multi-repo workloads.
+
+### Phase 5 — Shared workspace protocol
+- SF and ACE converge on the same `WorkspaceSpec` wire format.
+- SF's orchestrator can dispatch to ACE's worker pool (and vice versa).
+- The `sf-workspace` shim and ACE's VM dispatch path are the same binary.
+- **Outcome:** two orchestrators, one execution substrate.
+
+### Phase 6 — Orchestration convergence (long-term)
+- SF's state machine (milestone → slice → task) becomes an ACE PM persona.
+- ACE's HTDAG becomes the unified orchestration backbone.
+- SF's CLI and headless mode remain as user-facing entry points (they don't go away —
+  they become ACE clients over MCP).
+- **Outcome:** one system with SF's reliability and ACE's generality.
+
+---
+
+## What is NOT in scope for this ADR
+
+- Cross-tenant knowledge federation (single trust domain per deployment for now)
+- Public-internet exposure (tailnet-only, per ADR-013)
+- Replacing SF's state machine before Phase 6 — it works, don't touch it
+- Choosing the agent runtime inside the VM — language-agnostic by design
+- Cross-repo artifact DAG implementation — deferred to after Phase 3
+
+---
+
+## Risks
+
+| Risk | Mitigation |
+|------|-----------|
+| Firecracker cold-start latency (~125ms) is too slow for short SF tasks | Keep git-worktree path as fallback; VMs for tasks >5min |
+| VM snapshot size grows unboundedly for persistent agents | Snapshot rotation policy, same as activity log retention |
+| ACE HTDAG not stable enough for Phase 5 | Phase 5 is gated on ACE reliability, not a timeline. SF works fine until then. |
+| singularity-memory Go migration stalls | Phase 2 can use the Python server; migration is not on the critical path |
+| Cross-repo DAG design takes longer than expected | Phases 1–5 work without it; single-repo workspace is the common case |
+
+---
+
+## References
+
+- SF `docs/dev/ADR-012-multi-instance-federation.md` — federation surfaces
+- SF `docs/dev/ADR-013-network-and-remote-execution.md` — tailnet + SSH workers
+- SF `docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md` — Go migration (phases 0–3 only)
+- ACE `docs/architecture/sf-ace-convergence.md` — this ADR from ACE's perspective
+- ACE `ARCHITECTURE.md` §Sandbox — "No per-task container sandboxing" gap
+- ACE `docs/architecture/data-and-storage.md` — tenant_id schema
+- [Firecracker](https://firecracker-microvm.github.io/) — microVM hypervisor