docs: add ADR-019 workspace VM convergence architecture
Captures the SF↔ACE incremental convergence strategy: workspace VMs (Firecracker) as the unified execution isolation primitive, the three-layer architecture (orchestration/knowledge/execution), the 6-phase convergence path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned to ACE). Cross-references the matching ACE document at docs/architecture/sf-ace-convergence.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
10936277a5
commit
0976bbbb83
2 changed files with 231 additions and 0 deletions
|
|
@ -23,6 +23,7 @@ in `docs/dev/`. Lighter design docs (problem framing, event model decisions) liv
|
|||
| [ADR-016](../dev/ADR-016-charm-ai-stack-adoption.md) | Charm AI Stack Adoption | Proposed |
|
||||
| [ADR-017](../dev/ADR-017-charm-tui-client.md) | Charm TUI Client | Proposed |
|
||||
| [ADR-018](../dev/ADR-018-repo-native-harness-evolution.md) | Repo-Native Harness Evolution | Proposed — staged impl |
|
||||
| [ADR-019](../dev/ADR-019-workspace-vm-convergence.md) | Workspace VM Convergence — SF↔ACE incremental convergence via microVM execution layer | Proposed |
|
||||
|
||||
## Design Docs (this directory)
|
||||
|
||||
|
|
|
|||
230
docs/dev/ADR-019-workspace-vm-convergence.md
Normal file
230
docs/dev/ADR-019-workspace-vm-convergence.md
Normal file
|
|
@ -0,0 +1,230 @@
|
|||
# ADR-019: Workspace VM Convergence Architecture
|
||||
|
||||
**Status:** Proposed
|
||||
**Date:** 2026-05-01
|
||||
**Deciders:** Mikael Hugo
|
||||
**Context repos:** `singularity-forge` (SF), `ace-coder` (ACE)
|
||||
|
||||
> **Cross-repo note:** The matching document in ACE-coder lives at
|
||||
> `docs/architecture/sf-ace-convergence.md`. Both documents describe the same
|
||||
> architecture from each codebase's perspective. Keep them in sync when either
|
||||
> changes.
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Two autonomous agent systems are being developed in parallel:
|
||||
|
||||
- **SF** (`singularity-forge`) — TypeScript orchestrator. Works today. Dispatches
|
||||
Claude Code sessions as ephemeral units (milestone → slice → task). Isolation
|
||||
via git worktrees. Single-repo, single-user.
|
||||
|
||||
- **ACE** (`ace-coder`) — Python platform. Partially operational. HTDAG execution
|
||||
backbone, Project Manager ownership, 20 defined agent personas, LiteLLM
|
||||
multi-provider, RBAC, PGMQ task queue, tiered memory. Multi-tenant data model
|
||||
(`tenant_id`) exists; per-task execution isolation does not.
|
||||
|
||||
- **singularity-memory** — Separate Go service (migrating from Python per ADR-014).
|
||||
Postgres + vchord vector store. Federated knowledge layer shared across SF, ACE,
|
||||
Claude Code, Cursor, and other tools over MCP.
|
||||
|
||||
Both systems share the same end destination but are approaching it from different
|
||||
directions. SF is production-reliable but architecturally constrained (single-repo,
|
||||
git-worktree isolation). ACE has the right orchestration primitives (HTDAG, PM,
|
||||
RBAC, tenant model) but lacks execution isolation and is not yet production-reliable.
|
||||
|
||||
The strategy is **incremental convergence**: SF continues to work and delivers value
|
||||
while autonomously helping build out ACE. As ACE becomes reliable, SF's dispatch
|
||||
model transitions to use ACE's execution substrate. They meet at the workspace VM
|
||||
boundary.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### The unifying primitive: Workspace
|
||||
|
||||
```
|
||||
workspace = VM (microVM) + tenant_id + [repo_1, repo_2, ...] + scoped_credentials
|
||||
```
|
||||
|
||||
A **workspace** is the execution isolation unit for both systems. It replaces:
|
||||
- SF's git worktree per milestone
|
||||
- ACE's process-level `execution/worker.py` per task
|
||||
|
||||
A workspace is:
|
||||
- **A microVM** (Firecracker) — hard process/filesystem/network isolation at the
|
||||
hypervisor level. Firecracker was built by AWS specifically for multi-tenant
|
||||
Lambda; it provides the isolation both systems need without reimplementing it.
|
||||
- **Tenant-scoped** — maps to ACE's existing `tenant_id` on `agent_memory` and
|
||||
`task_queue`. The VM boundary is the enforcement point; the database tenant_id
|
||||
is the tracking point.
|
||||
- **Multi-repo** — the orchestrator tells the VM which repos to clone/mount. The
|
||||
VM operates across all of them. Cross-repo work is trivially a list.
|
||||
- **Credential-scoped** — per-workspace credentials (git tokens, API keys) are
|
||||
injected at VM start and destroyed at VM exit. Never shared across tenants.
|
||||
- **Snapshot/restore** — VM state snapshots replace `.sf/paused-session.json` and
|
||||
ACE's `checkpoint_service`. A "persistent agent" is a named snapshot: restore it,
|
||||
the agent wakes with full memory and context intact.
|
||||
|
||||
### The three-layer architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Orchestration layer │
|
||||
│ │
|
||||
│ Near-term: SF (TS state machine) dispatches to workspace VMs │
|
||||
│ Long-term: ACE HTDAG/PM becomes the orchestration backbone; │
|
||||
│ SF state machine becomes an ACE PM persona │
|
||||
│ │
|
||||
│ Language: TS (SF today) → Python (ACE, when reliable) │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ Knowledge layer │
|
||||
│ │
|
||||
│ singularity-memory: Go + Postgres + vchord + MCP server │
|
||||
│ Serves all consumers (SF, ACE, Claude Code, Cursor) over MCP. │
|
||||
│ Tenant-scoped knowledge banks (to be designed — see below). │
|
||||
│ │
|
||||
│ Language: Go (ADR-014 migration, phases 0–3 only — NOT phase 4) │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ Execution layer │
|
||||
│ │
|
||||
│ workspace = VM + tenant + repos + credentials │
|
||||
│ One workspace per dispatch unit. │
|
||||
│ VM management shim: Rust (Firecracker API is Rust-native). │
|
||||
│ Agent runtime inside VM: whatever the task requires. │
|
||||
│ │
|
||||
│ Language: Rust (VM shim) + anything (inside VM) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### ADR-014 Phase 4 is reassigned
|
||||
|
||||
ADR-014 proposed building a "central persistent-agent runtime" in Go using
|
||||
`charmbracelet/fantasy`. This is **cancelled**. Persistent agents live as VM
|
||||
snapshots managed by ACE's orchestration layer — not as a separate Go runtime.
|
||||
singularity-memory (Go) scopes to the knowledge layer only (ADR-014 phases 0–3).
|
||||
|
||||
---
|
||||
|
||||
## Multi-tenant design
|
||||
|
||||
**What exists (ACE):**
|
||||
- `tenant_id` on `agent_memory`, `task_queue` ✅
|
||||
- RBAC (`rbac_capability_policy.py`, agent permission levels) ✅
|
||||
- PM-driven governance and approval gates ✅
|
||||
|
||||
**What needs to be built:**
|
||||
- Tenant-scoped knowledge banks in singularity-memory (each tenant's memory is
|
||||
isolated; cross-tenant sharing requires explicit federation grants)
|
||||
- VM pool with per-tenant resource quotas (CPU, RAM, disk, LLM token budget)
|
||||
- Cost accounting: LLM gateway already tracks per-worker; extend with `tenant_id`
|
||||
- Headscale ACL rules per tenant (each tenant's VMs on their own tailnet ACL)
|
||||
|
||||
**The enforcement model:**
|
||||
```
|
||||
tenant_id (DB) + VM boundary (hypervisor) + Headscale ACL (network) = full isolation
|
||||
```
|
||||
|
||||
No single layer is sufficient alone. All three enforce the same boundary from
|
||||
different angles.
|
||||
|
||||
---
|
||||
|
||||
## Multi-repo design
|
||||
|
||||
**The workspace repo list:**
|
||||
|
||||
```typescript
|
||||
// SF orchestrator side
|
||||
type WorkspaceSpec = {
|
||||
tenant_id: string;
|
||||
repos: Array<{ url: string; ref: string; mount: string }>;
|
||||
credentials: CredentialRef[];
|
||||
snapshot_id?: string; // resume from saved state
|
||||
};
|
||||
```
|
||||
|
||||
**Cross-repo artifact DAG (new primitive, not in either system yet):**
|
||||
|
||||
When a task produces artifacts spanning multiple repos, HTDAG needs to track which
|
||||
commits in which repos constitute "done". This is the **cross-repo task graph** —
|
||||
probably a new node type in HTDAG's DAG structure. Design deferred until the
|
||||
workspace VM primitive is stable.
|
||||
|
||||
---
|
||||
|
||||
## Incremental convergence path
|
||||
|
||||
### Phase 1 — SF continues, ACE gets built (now)
|
||||
- SF runs autonomous milestones on `ace-coder`. No changes to SF.
|
||||
- ACE develops its HTDAG, PM, and worker primitives independently.
|
||||
- Both systems mature on their own tracks.
|
||||
|
||||
### Phase 2 — Federated memory (near-term, ADR-012 Tier 1)
|
||||
- Wire `memory-store.ts` remote-mode → singularity-memory HTTP endpoint.
|
||||
- SF instances on different machines share learnings.
|
||||
- ACE connects to the same singularity-memory endpoint (same MCP wire).
|
||||
- **Outcome:** shared knowledge layer operational before execution convergence.
|
||||
|
||||
### Phase 3 — Workspace VM opt-in for SF (medium-term)
|
||||
- Build `sf-workspace` shim: thin Rust binary that manages Firecracker VMs.
|
||||
- SF's `runUnit()` dispatches to workspace VM instead of raw Claude Code session
|
||||
when project preference `workspace.isolation: "vm"` is set.
|
||||
- Git worktree path remains for projects that haven't opted in.
|
||||
- **Outcome:** SF can run multi-repo and multi-tenant workloads experimentally.
|
||||
|
||||
### Phase 4 — ACE workers → workspace VMs (parallel to Phase 3)
|
||||
- ACE's `execution/worker.py` (async task pool) gains workspace VM dispatch path.
|
||||
- ACE fills the explicit gap noted in its own competitive analysis:
|
||||
*"No per-task container sandboxing... ACE has process-level sandboxing only."*
|
||||
- RBAC + tenant_id at data layer + VM at execution layer = full multi-tenant ACE.
|
||||
- **Outcome:** ACE can handle multi-tenant, multi-repo workloads.
|
||||
|
||||
### Phase 5 — Shared workspace protocol
|
||||
- SF and ACE converge on the same `WorkspaceSpec` wire format.
|
||||
- SF's orchestrator can dispatch to ACE's worker pool (and vice versa).
|
||||
- The `sf-workspace` shim and ACE's VM dispatch path are the same binary.
|
||||
- **Outcome:** two orchestrators, one execution substrate.
|
||||
|
||||
### Phase 6 — Orchestration convergence (long-term)
|
||||
- SF's state machine (milestone → slice → task) becomes an ACE PM persona.
|
||||
- ACE's HTDAG becomes the unified orchestration backbone.
|
||||
- SF's CLI and headless mode remain as user-facing entry points (they don't go away —
|
||||
they become ACE clients over MCP).
|
||||
- **Outcome:** one system with SF's reliability and ACE's generality.
|
||||
|
||||
---
|
||||
|
||||
## What is NOT in scope for this ADR
|
||||
|
||||
- Cross-tenant knowledge federation (single trust domain per deployment for now)
|
||||
- Public-internet exposure (tailnet-only, per ADR-013)
|
||||
- Replacing SF's state machine before Phase 6 — it works, don't touch it
|
||||
- Choosing the agent runtime inside the VM — language-agnostic by design
|
||||
- Cross-repo artifact DAG implementation — deferred to after Phase 3
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|-----------|
|
||||
| Firecracker cold-start latency (~125ms) is too slow for short SF tasks | Keep git-worktree path as fallback; VMs for tasks >5min |
|
||||
| VM snapshot size grows unboundedly for persistent agents | Snapshot rotation policy, same as activity log retention |
|
||||
| ACE HTDAG not stable enough for Phase 5 | Phase 5 is gated on ACE reliability, not a timeline. SF works fine until then. |
|
||||
| singularity-memory Go migration stalls | Phase 2 can use the Python server; migration is not on the critical path |
|
||||
| Cross-repo DAG design takes longer than expected | Phases 1–5 work without it; single-repo workspace is the common case |
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- SF `docs/dev/ADR-012-multi-instance-federation.md` — federation surfaces
|
||||
- SF `docs/dev/ADR-013-network-and-remote-execution.md` — tailnet + SSH workers
|
||||
- SF `docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md` — Go migration (phases 0–3 only)
|
||||
- ACE `docs/architecture/sf-ace-convergence.md` — this ADR from ACE's perspective
|
||||
- ACE `ARCHITECTURE.md` §Sandbox — "No per-task container sandboxing" gap
|
||||
- ACE `docs/architecture/data-and-storage.md` — tenant_id schema
|
||||
- [Firecracker](https://firecracker-microvm.github.io/) — microVM hypervisor
|
||||
Loading…
Add table
Reference in a new issue