docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Mikael Hugo 2026-05-01 23:21:23 +02:00
parent 10936277a5
commit 0976bbbb83
2 changed files with 231 additions and 0 deletions

View file

@ -23,6 +23,7 @@ in `docs/dev/`. Lighter design docs (problem framing, event model decisions) liv
| [ADR-016](../dev/ADR-016-charm-ai-stack-adoption.md) | Charm AI Stack Adoption | Proposed |
| [ADR-017](../dev/ADR-017-charm-tui-client.md) | Charm TUI Client | Proposed |
| [ADR-018](../dev/ADR-018-repo-native-harness-evolution.md) | Repo-Native Harness Evolution | Proposed — staged impl |
| [ADR-019](../dev/ADR-019-workspace-vm-convergence.md) | Workspace VM Convergence — SF↔ACE incremental convergence via microVM execution layer | Proposed |
## Design Docs (this directory)

View file

@ -0,0 +1,230 @@
# ADR-019: Workspace VM Convergence Architecture
**Status:** Proposed
**Date:** 2026-05-01
**Deciders:** Mikael Hugo
**Context repos:** `singularity-forge` (SF), `ace-coder` (ACE)
> **Cross-repo note:** The matching document in ACE-coder lives at
> `docs/architecture/sf-ace-convergence.md`. Both documents describe the same
> architecture from each codebase's perspective. Keep them in sync when either
> changes.
---
## Context
Two autonomous agent systems are being developed in parallel:
- **SF** (`singularity-forge`) — TypeScript orchestrator. Works today. Dispatches
Claude Code sessions as ephemeral units (milestone → slice → task). Isolation
via git worktrees. Single-repo, single-user.
- **ACE** (`ace-coder`) — Python platform. Partially operational. HTDAG execution
backbone, Project Manager ownership, 20 defined agent personas, LiteLLM
multi-provider, RBAC, PGMQ task queue, tiered memory. Multi-tenant data model
(`tenant_id`) exists; per-task execution isolation does not.
- **singularity-memory** — Separate Go service (migrating from Python per ADR-014).
Postgres + vchord vector store. Federated knowledge layer shared across SF, ACE,
Claude Code, Cursor, and other tools over MCP.
Both systems share the same end destination but are approaching it from different
directions. SF is production-reliable but architecturally constrained (single-repo,
git-worktree isolation). ACE has the right orchestration primitives (HTDAG, PM,
RBAC, tenant model) but lacks execution isolation and is not yet production-reliable.
The strategy is **incremental convergence**: SF continues to work and delivers value
while autonomously helping build out ACE. As ACE becomes reliable, SF's dispatch
model transitions to use ACE's execution substrate. They meet at the workspace VM
boundary.
---
## Decision
### The unifying primitive: Workspace
```
workspace = VM (microVM) + tenant_id + [repo_1, repo_2, ...] + scoped_credentials
```
A **workspace** is the execution isolation unit for both systems. It replaces:
- SF's git worktree per milestone
- ACE's process-level `execution/worker.py` per task
A workspace is:
- **A microVM** (Firecracker) — hard process/filesystem/network isolation at the
hypervisor level. Firecracker was built by AWS specifically for multi-tenant
Lambda; it provides the isolation both systems need without reimplementing it.
- **Tenant-scoped** — maps to ACE's existing `tenant_id` on `agent_memory` and
`task_queue`. The VM boundary is the enforcement point; the database tenant_id
is the tracking point.
- **Multi-repo** — the orchestrator tells the VM which repos to clone/mount. The
VM operates across all of them. Cross-repo work is trivially a list.
- **Credential-scoped** — per-workspace credentials (git tokens, API keys) are
injected at VM start and destroyed at VM exit. Never shared across tenants.
- **Snapshot/restore** — VM state snapshots replace `.sf/paused-session.json` and
ACE's `checkpoint_service`. A "persistent agent" is a named snapshot: restore it,
the agent wakes with full memory and context intact.
### The three-layer architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ Orchestration layer │
│ │
│ Near-term: SF (TS state machine) dispatches to workspace VMs │
│ Long-term: ACE HTDAG/PM becomes the orchestration backbone; │
│ SF state machine becomes an ACE PM persona │
│ │
│ Language: TS (SF today) → Python (ACE, when reliable) │
├─────────────────────────────────────────────────────────────────────┤
│ Knowledge layer │
│ │
│ singularity-memory: Go + Postgres + vchord + MCP server │
│ Serves all consumers (SF, ACE, Claude Code, Cursor) over MCP. │
│ Tenant-scoped knowledge banks (to be designed — see below). │
│ │
│ Language: Go (ADR-014 migration, phases 03 only — NOT phase 4) │
├─────────────────────────────────────────────────────────────────────┤
│ Execution layer │
│ │
│ workspace = VM + tenant + repos + credentials │
│ One workspace per dispatch unit. │
│ VM management shim: Rust (Firecracker API is Rust-native). │
│ Agent runtime inside VM: whatever the task requires. │
│ │
│ Language: Rust (VM shim) + anything (inside VM) │
└─────────────────────────────────────────────────────────────────────┘
```
### ADR-014 Phase 4 is reassigned
ADR-014 proposed building a "central persistent-agent runtime" in Go using
`charmbracelet/fantasy`. This is **cancelled**. Persistent agents live as VM
snapshots managed by ACE's orchestration layer — not as a separate Go runtime.
singularity-memory (Go) scopes to the knowledge layer only (ADR-014 phases 03).
---
## Multi-tenant design
**What exists (ACE):**
- `tenant_id` on `agent_memory`, `task_queue`
- RBAC (`rbac_capability_policy.py`, agent permission levels) ✅
- PM-driven governance and approval gates ✅
**What needs to be built:**
- Tenant-scoped knowledge banks in singularity-memory (each tenant's memory is
isolated; cross-tenant sharing requires explicit federation grants)
- VM pool with per-tenant resource quotas (CPU, RAM, disk, LLM token budget)
- Cost accounting: LLM gateway already tracks per-worker; extend with `tenant_id`
- Headscale ACL rules per tenant (each tenant's VMs on their own tailnet ACL)
**The enforcement model:**
```
tenant_id (DB) + VM boundary (hypervisor) + Headscale ACL (network) = full isolation
```
No single layer is sufficient alone. All three enforce the same boundary from
different angles.
---
## Multi-repo design
**The workspace repo list:**
```typescript
// SF orchestrator side
type WorkspaceSpec = {
tenant_id: string;
repos: Array<{ url: string; ref: string; mount: string }>;
credentials: CredentialRef[];
snapshot_id?: string; // resume from saved state
};
```
**Cross-repo artifact DAG (new primitive, not in either system yet):**
When a task produces artifacts spanning multiple repos, HTDAG needs to track which
commits in which repos constitute "done". This is the **cross-repo task graph**
probably a new node type in HTDAG's DAG structure. Design deferred until the
workspace VM primitive is stable.
---
## Incremental convergence path
### Phase 1 — SF continues, ACE gets built (now)
- SF runs autonomous milestones on `ace-coder`. No changes to SF.
- ACE develops its HTDAG, PM, and worker primitives independently.
- Both systems mature on their own tracks.
### Phase 2 — Federated memory (near-term, ADR-012 Tier 1)
- Wire `memory-store.ts` remote-mode → singularity-memory HTTP endpoint.
- SF instances on different machines share learnings.
- ACE connects to the same singularity-memory endpoint (same MCP wire).
- **Outcome:** shared knowledge layer operational before execution convergence.
### Phase 3 — Workspace VM opt-in for SF (medium-term)
- Build `sf-workspace` shim: thin Rust binary that manages Firecracker VMs.
- SF's `runUnit()` dispatches to workspace VM instead of raw Claude Code session
when project preference `workspace.isolation: "vm"` is set.
- Git worktree path remains for projects that haven't opted in.
- **Outcome:** SF can run multi-repo and multi-tenant workloads experimentally.
### Phase 4 — ACE workers → workspace VMs (parallel to Phase 3)
- ACE's `execution/worker.py` (async task pool) gains workspace VM dispatch path.
- ACE fills the explicit gap noted in its own competitive analysis:
*"No per-task container sandboxing... ACE has process-level sandboxing only."*
- RBAC + tenant_id at data layer + VM at execution layer = full multi-tenant ACE.
- **Outcome:** ACE can handle multi-tenant, multi-repo workloads.
### Phase 5 — Shared workspace protocol
- SF and ACE converge on the same `WorkspaceSpec` wire format.
- SF's orchestrator can dispatch to ACE's worker pool (and vice versa).
- The `sf-workspace` shim and ACE's VM dispatch path are the same binary.
- **Outcome:** two orchestrators, one execution substrate.
### Phase 6 — Orchestration convergence (long-term)
- SF's state machine (milestone → slice → task) becomes an ACE PM persona.
- ACE's HTDAG becomes the unified orchestration backbone.
- SF's CLI and headless mode remain as user-facing entry points (they don't go away —
they become ACE clients over MCP).
- **Outcome:** one system with SF's reliability and ACE's generality.
---
## What is NOT in scope for this ADR
- Cross-tenant knowledge federation (single trust domain per deployment for now)
- Public-internet exposure (tailnet-only, per ADR-013)
- Replacing SF's state machine before Phase 6 — it works, don't touch it
- Choosing the agent runtime inside the VM — language-agnostic by design
- Cross-repo artifact DAG implementation — deferred to after Phase 3
---
## Risks
| Risk | Mitigation |
|------|-----------|
| Firecracker cold-start latency (~125ms) is too slow for short SF tasks | Keep git-worktree path as fallback; VMs for tasks >5min |
| VM snapshot size grows unboundedly for persistent agents | Snapshot rotation policy, same as activity log retention |
| ACE HTDAG not stable enough for Phase 5 | Phase 5 is gated on ACE reliability, not a timeline. SF works fine until then. |
| singularity-memory Go migration stalls | Phase 2 can use the Python server; migration is not on the critical path |
| Cross-repo DAG design takes longer than expected | Phases 15 work without it; single-repo workspace is the common case |
---
## References
- SF `docs/dev/ADR-012-multi-instance-federation.md` — federation surfaces
- SF `docs/dev/ADR-013-network-and-remote-execution.md` — tailnet + SSH workers
- SF `docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md` — Go migration (phases 03 only)
- ACE `docs/architecture/sf-ace-convergence.md` — this ADR from ACE's perspective
- ACE `ARCHITECTURE.md` §Sandbox — "No per-task container sandboxing" gap
- ACE `docs/architecture/data-and-storage.md` — tenant_id schema
- [Firecracker](https://firecracker-microvm.github.io/) — microVM hypervisor