diff --git a/docs/design-docs/index.md b/docs/design-docs/index.md index 762f49f20..10101315f 100644 --- a/docs/design-docs/index.md +++ b/docs/design-docs/index.md @@ -23,6 +23,7 @@ in `docs/dev/`. Lighter design docs (problem framing, event model decisions) liv | [ADR-016](../dev/ADR-016-charm-ai-stack-adoption.md) | Charm AI Stack Adoption | Proposed | | [ADR-017](../dev/ADR-017-charm-tui-client.md) | Charm TUI Client | Proposed | | [ADR-018](../dev/ADR-018-repo-native-harness-evolution.md) | Repo-Native Harness Evolution | Proposed — staged impl | +| [ADR-019](../dev/ADR-019-workspace-vm-convergence.md) | Workspace VM Convergence — SF↔ACE incremental convergence via microVM execution layer | Proposed | ## Design Docs (this directory) diff --git a/docs/dev/ADR-019-workspace-vm-convergence.md b/docs/dev/ADR-019-workspace-vm-convergence.md new file mode 100644 index 000000000..b3678a334 --- /dev/null +++ b/docs/dev/ADR-019-workspace-vm-convergence.md @@ -0,0 +1,230 @@ +# ADR-019: Workspace VM Convergence Architecture + +**Status:** Proposed +**Date:** 2026-05-01 +**Deciders:** Mikael Hugo +**Context repos:** `singularity-forge` (SF), `ace-coder` (ACE) + +> **Cross-repo note:** The matching document in ACE-coder lives at +> `docs/architecture/sf-ace-convergence.md`. Both documents describe the same +> architecture from each codebase's perspective. Keep them in sync when either +> changes. + +--- + +## Context + +Two autonomous agent systems are being developed in parallel: + +- **SF** (`singularity-forge`) — TypeScript orchestrator. Works today. Dispatches + Claude Code sessions as ephemeral units (milestone → slice → task). Isolation + via git worktrees. Single-repo, single-user. + +- **ACE** (`ace-coder`) — Python platform. Partially operational. HTDAG execution + backbone, Project Manager ownership, 20 defined agent personas, LiteLLM + multi-provider, RBAC, PGMQ task queue, tiered memory. Multi-tenant data model + (`tenant_id`) exists; per-task execution isolation does not. + +- **singularity-memory** — Separate Go service (migrating from Python per ADR-014). + Postgres + vchord vector store. Federated knowledge layer shared across SF, ACE, + Claude Code, Cursor, and other tools over MCP. + +Both systems share the same end destination but are approaching it from different +directions. SF is production-reliable but architecturally constrained (single-repo, +git-worktree isolation). ACE has the right orchestration primitives (HTDAG, PM, +RBAC, tenant model) but lacks execution isolation and is not yet production-reliable. + +The strategy is **incremental convergence**: SF continues to work and delivers value +while autonomously helping build out ACE. As ACE becomes reliable, SF's dispatch +model transitions to use ACE's execution substrate. They meet at the workspace VM +boundary. + +--- + +## Decision + +### The unifying primitive: Workspace + +``` +workspace = VM (microVM) + tenant_id + [repo_1, repo_2, ...] + scoped_credentials +``` + +A **workspace** is the execution isolation unit for both systems. It replaces: +- SF's git worktree per milestone +- ACE's process-level `execution/worker.py` per task + +A workspace is: +- **A microVM** (Firecracker) — hard process/filesystem/network isolation at the + hypervisor level. Firecracker was built by AWS specifically for multi-tenant + Lambda; it provides the isolation both systems need without reimplementing it. +- **Tenant-scoped** — maps to ACE's existing `tenant_id` on `agent_memory` and + `task_queue`. The VM boundary is the enforcement point; the database tenant_id + is the tracking point. +- **Multi-repo** — the orchestrator tells the VM which repos to clone/mount. The + VM operates across all of them. Cross-repo work is trivially a list. +- **Credential-scoped** — per-workspace credentials (git tokens, API keys) are + injected at VM start and destroyed at VM exit. Never shared across tenants. +- **Snapshot/restore** — VM state snapshots replace `.sf/paused-session.json` and + ACE's `checkpoint_service`. A "persistent agent" is a named snapshot: restore it, + the agent wakes with full memory and context intact. + +### The three-layer architecture + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Orchestration layer │ +│ │ +│ Near-term: SF (TS state machine) dispatches to workspace VMs │ +│ Long-term: ACE HTDAG/PM becomes the orchestration backbone; │ +│ SF state machine becomes an ACE PM persona │ +│ │ +│ Language: TS (SF today) → Python (ACE, when reliable) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Knowledge layer │ +│ │ +│ singularity-memory: Go + Postgres + vchord + MCP server │ +│ Serves all consumers (SF, ACE, Claude Code, Cursor) over MCP. │ +│ Tenant-scoped knowledge banks (to be designed — see below). │ +│ │ +│ Language: Go (ADR-014 migration, phases 0–3 only — NOT phase 4) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Execution layer │ +│ │ +│ workspace = VM + tenant + repos + credentials │ +│ One workspace per dispatch unit. │ +│ VM management shim: Rust (Firecracker API is Rust-native). │ +│ Agent runtime inside VM: whatever the task requires. │ +│ │ +│ Language: Rust (VM shim) + anything (inside VM) │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +### ADR-014 Phase 4 is reassigned + +ADR-014 proposed building a "central persistent-agent runtime" in Go using +`charmbracelet/fantasy`. This is **cancelled**. Persistent agents live as VM +snapshots managed by ACE's orchestration layer — not as a separate Go runtime. +singularity-memory (Go) scopes to the knowledge layer only (ADR-014 phases 0–3). + +--- + +## Multi-tenant design + +**What exists (ACE):** +- `tenant_id` on `agent_memory`, `task_queue` ✅ +- RBAC (`rbac_capability_policy.py`, agent permission levels) ✅ +- PM-driven governance and approval gates ✅ + +**What needs to be built:** +- Tenant-scoped knowledge banks in singularity-memory (each tenant's memory is + isolated; cross-tenant sharing requires explicit federation grants) +- VM pool with per-tenant resource quotas (CPU, RAM, disk, LLM token budget) +- Cost accounting: LLM gateway already tracks per-worker; extend with `tenant_id` +- Headscale ACL rules per tenant (each tenant's VMs on their own tailnet ACL) + +**The enforcement model:** +``` +tenant_id (DB) + VM boundary (hypervisor) + Headscale ACL (network) = full isolation +``` + +No single layer is sufficient alone. All three enforce the same boundary from +different angles. + +--- + +## Multi-repo design + +**The workspace repo list:** + +```typescript +// SF orchestrator side +type WorkspaceSpec = { + tenant_id: string; + repos: Array<{ url: string; ref: string; mount: string }>; + credentials: CredentialRef[]; + snapshot_id?: string; // resume from saved state +}; +``` + +**Cross-repo artifact DAG (new primitive, not in either system yet):** + +When a task produces artifacts spanning multiple repos, HTDAG needs to track which +commits in which repos constitute "done". This is the **cross-repo task graph** — +probably a new node type in HTDAG's DAG structure. Design deferred until the +workspace VM primitive is stable. + +--- + +## Incremental convergence path + +### Phase 1 — SF continues, ACE gets built (now) +- SF runs autonomous milestones on `ace-coder`. No changes to SF. +- ACE develops its HTDAG, PM, and worker primitives independently. +- Both systems mature on their own tracks. + +### Phase 2 — Federated memory (near-term, ADR-012 Tier 1) +- Wire `memory-store.ts` remote-mode → singularity-memory HTTP endpoint. +- SF instances on different machines share learnings. +- ACE connects to the same singularity-memory endpoint (same MCP wire). +- **Outcome:** shared knowledge layer operational before execution convergence. + +### Phase 3 — Workspace VM opt-in for SF (medium-term) +- Build `sf-workspace` shim: thin Rust binary that manages Firecracker VMs. +- SF's `runUnit()` dispatches to workspace VM instead of raw Claude Code session + when project preference `workspace.isolation: "vm"` is set. +- Git worktree path remains for projects that haven't opted in. +- **Outcome:** SF can run multi-repo and multi-tenant workloads experimentally. + +### Phase 4 — ACE workers → workspace VMs (parallel to Phase 3) +- ACE's `execution/worker.py` (async task pool) gains workspace VM dispatch path. +- ACE fills the explicit gap noted in its own competitive analysis: + *"No per-task container sandboxing... ACE has process-level sandboxing only."* +- RBAC + tenant_id at data layer + VM at execution layer = full multi-tenant ACE. +- **Outcome:** ACE can handle multi-tenant, multi-repo workloads. + +### Phase 5 — Shared workspace protocol +- SF and ACE converge on the same `WorkspaceSpec` wire format. +- SF's orchestrator can dispatch to ACE's worker pool (and vice versa). +- The `sf-workspace` shim and ACE's VM dispatch path are the same binary. +- **Outcome:** two orchestrators, one execution substrate. + +### Phase 6 — Orchestration convergence (long-term) +- SF's state machine (milestone → slice → task) becomes an ACE PM persona. +- ACE's HTDAG becomes the unified orchestration backbone. +- SF's CLI and headless mode remain as user-facing entry points (they don't go away — + they become ACE clients over MCP). +- **Outcome:** one system with SF's reliability and ACE's generality. + +--- + +## What is NOT in scope for this ADR + +- Cross-tenant knowledge federation (single trust domain per deployment for now) +- Public-internet exposure (tailnet-only, per ADR-013) +- Replacing SF's state machine before Phase 6 — it works, don't touch it +- Choosing the agent runtime inside the VM — language-agnostic by design +- Cross-repo artifact DAG implementation — deferred to after Phase 3 + +--- + +## Risks + +| Risk | Mitigation | +|------|-----------| +| Firecracker cold-start latency (~125ms) is too slow for short SF tasks | Keep git-worktree path as fallback; VMs for tasks >5min | +| VM snapshot size grows unboundedly for persistent agents | Snapshot rotation policy, same as activity log retention | +| ACE HTDAG not stable enough for Phase 5 | Phase 5 is gated on ACE reliability, not a timeline. SF works fine until then. | +| singularity-memory Go migration stalls | Phase 2 can use the Python server; migration is not on the critical path | +| Cross-repo DAG design takes longer than expected | Phases 1–5 work without it; single-repo workspace is the common case | + +--- + +## References + +- SF `docs/dev/ADR-012-multi-instance-federation.md` — federation surfaces +- SF `docs/dev/ADR-013-network-and-remote-execution.md` — tailnet + SSH workers +- SF `docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md` — Go migration (phases 0–3 only) +- ACE `docs/architecture/sf-ace-convergence.md` — this ADR from ACE's perspective +- ACE `ARCHITECTURE.md` §Sandbox — "No per-task container sandboxing" gap +- ACE `docs/architecture/data-and-storage.md` — tenant_id schema +- [Firecracker](https://firecracker-microvm.github.io/) — microVM hypervisor