Phase 1 — close SF-side polish gaps: - codebase-generator: distinguish uv/poetry/pdm in Python stack-signals; surface configured tooling (ruff/mypy/pyright) when config files exist - doctor-environment: new checkPythonEnvironment — detects uv/poetry/pdm via lockfile, verifies binary on PATH, warns with install hint when missing - doctor-environment: new checkSiftAvailable — recommends sift install for repos > 5000 source files when not on PATH - tech-debt-tracker: documented future memory-as-sub-extension extraction (defer until real backend-swap requirement) Phase 2 — internal wire architecture: - ADR-020: singularity-grpc as shared schema repo; gRPC + typed clients for first-party services; MCP façade only at external-tool boundary - ADR-019: trimmed MCP scope section to a 3-line summary linking to ADR-020 to avoid the wire-format table living in two places - design-docs/index.md: ADR-020 added to ADR table These changes make SF stronger for autonomous work on Python repos (particularly ace-coder) and capture the internal wire architecture decision as a durable ADR before any singularity-grpc code lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
277 lines
14 KiB
Markdown
277 lines
14 KiB
Markdown
# ADR-019: Workspace VM Convergence Architecture
|
||
|
||
**Status:** Proposed
|
||
**Date:** 2026-05-01
|
||
**Deciders:** Mikael Hugo
|
||
**Context repos:** `singularity-forge` (SF), `ace-coder` (ACE)
|
||
|
||
> **Cross-repo note:** The matching document in ACE-coder lives at
|
||
> `docs/architecture/sf-ace-convergence.md`. Both documents describe the same
|
||
> architecture from each codebase's perspective. Keep them in sync when either
|
||
> changes.
|
||
|
||
---
|
||
|
||
## Context
|
||
|
||
Two autonomous agent systems are being developed in parallel:
|
||
|
||
- **SF** (`singularity-forge`) — TypeScript orchestrator. Works today. Dispatches
|
||
Claude Code sessions as ephemeral units (milestone → slice → task). Isolation
|
||
via git worktrees. **Single-machine, single-user, single-repo by design.** That
|
||
scope is its character, not a limitation. SF stays a standalone app permanently;
|
||
it does not grow into a platform.
|
||
|
||
- **ACE** (`ace-coder`) — Python platform. Partially operational. HTDAG execution
|
||
backbone, Project Manager ownership, 20 defined agent personas, LiteLLM
|
||
multi-provider, RBAC, PGMQ task queue, tiered memory. Multi-tenant data model
|
||
(`tenant_id`) exists; per-task execution isolation does not. ACE is where
|
||
multi-tenant, multi-repo, federated workloads live.
|
||
|
||
- **singularity-memory** — Separate Go service (migrating from Python per ADR-014).
|
||
Postgres + vchord vector store. Federated knowledge layer.
|
||
- **Internal consumers** (SF, ACE, future first-party services) talk to it via
|
||
typed direct clients (HTTP/gRPC generated from the Go API). No MCP, no JSON-RPC
|
||
framing, no protocol cost.
|
||
- **External coding tools** (Claude Code, Cursor, third-party LLM clients) get
|
||
an MCP façade. This is a temporary scaffold so external coders can read/write
|
||
memory while they help build the system; it is not the production wire for
|
||
internal services and is expected to shrink once the system is self-hosting.
|
||
|
||
The two systems are **not converging into one app.** They occupy different niches:
|
||
|
||
- SF is the local single-user developer tool — fast, generic, runs on the developer's
|
||
machine on whatever repo they're working on.
|
||
- ACE is the multi-tenant platform — federated, multi-repo, scales beyond one user.
|
||
|
||
Convergence in this ADR refers to **shared substrate**, not application merging:
|
||
shared wire schemas (singularity-grpc), shared execution isolation primitive
|
||
(Firecracker workspaces) when SF chooses to dispatch into one. SF can live entirely
|
||
on its own without ACE; ACE doesn't depend on SF.
|
||
|
||
The strategy is **incremental pattern transfer**: SF continues to work as a
|
||
standalone single-user tool while autonomously helping build out ACE. ACE ports
|
||
proven patterns from SF as it matures. SF gains an optional engine adapter for
|
||
dispatching units into ACE workspaces when multi-tenant or multi-repo work is
|
||
needed. Neither replaces the other.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
### The unifying primitive: Workspace
|
||
|
||
```
|
||
workspace = VM (microVM) + tenant_id + [repo_1, repo_2, ...] + scoped_credentials
|
||
```
|
||
|
||
A **workspace** is the execution isolation unit for both systems. It replaces:
|
||
- SF's git worktree per milestone
|
||
- ACE's process-level `execution/worker.py` per task
|
||
|
||
A workspace is:
|
||
- **A microVM** (Firecracker) — hard process/filesystem/network isolation at the
|
||
hypervisor level. Firecracker was built by AWS specifically for multi-tenant
|
||
Lambda; it provides the isolation both systems need without reimplementing it.
|
||
- **Tenant-scoped** — maps to ACE's existing `tenant_id` on `agent_memory` and
|
||
`task_queue`. The VM boundary is the enforcement point; the database tenant_id
|
||
is the tracking point.
|
||
- **Multi-repo** — the orchestrator tells the VM which repos to clone/mount. The
|
||
VM operates across all of them. Cross-repo work is trivially a list.
|
||
- **Credential-scoped** — per-workspace credentials (git tokens, API keys) are
|
||
injected at VM start and destroyed at VM exit. Never shared across tenants.
|
||
- **Snapshot/restore** — VM state snapshots replace `.sf/paused-session.json` and
|
||
ACE's `checkpoint_service`. A "persistent agent" is a named snapshot: restore it,
|
||
the agent wakes with full memory and context intact.
|
||
|
||
### The three-layer architecture
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ Orchestration layer │
|
||
│ │
|
||
│ Near-term: SF (TS state machine) dispatches to workspace VMs │
|
||
│ Long-term: ACE HTDAG/PM becomes the orchestration backbone; │
|
||
│ SF state machine becomes an ACE PM persona │
|
||
│ │
|
||
│ Language: TS (SF today) → Python (ACE, when reliable) │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Knowledge layer │
|
||
│ │
|
||
│ singularity-memory: Go + Postgres + vchord │
|
||
│ Internal services (SF, ACE) use typed direct clients (HTTP/gRPC). │
|
||
│ External coding tools (Claude Code, Cursor) use an MCP façade — │
|
||
│ temporary scaffold while external coders help build the system. │
|
||
│ Tenant-scoped knowledge banks (to be designed — see below). │
|
||
│ │
|
||
│ Language: Go (ADR-014 migration, phases 0–3 only — NOT phase 4) │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Execution layer │
|
||
│ │
|
||
│ workspace = VM + tenant + repos + credentials │
|
||
│ One workspace per dispatch unit. │
|
||
│ VM management shim: Rust (Firecracker API is Rust-native). │
|
||
│ Agent runtime inside VM: whatever the task requires. │
|
||
│ │
|
||
│ Language: Rust (VM shim) + anything (inside VM) │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### ADR-014 Phase 4 is reassigned
|
||
|
||
ADR-014 proposed building a "central persistent-agent runtime" in Go using
|
||
`charmbracelet/fantasy`. This is **cancelled**. Persistent agents live as VM
|
||
snapshots managed by ACE's orchestration layer — not as a separate Go runtime.
|
||
singularity-memory (Go) scopes to the knowledge layer only (ADR-014 phases 0–3).
|
||
|
||
---
|
||
|
||
## Multi-tenant design
|
||
|
||
**What exists (ACE):**
|
||
- `tenant_id` on `agent_memory`, `task_queue` ✅
|
||
- RBAC (`rbac_capability_policy.py`, agent permission levels) ✅
|
||
- PM-driven governance and approval gates ✅
|
||
|
||
**What needs to be built:**
|
||
- Tenant-scoped knowledge banks in singularity-memory (each tenant's memory is
|
||
isolated; cross-tenant sharing requires explicit federation grants)
|
||
- VM pool with per-tenant resource quotas (CPU, RAM, disk, LLM token budget)
|
||
- Cost accounting: LLM gateway already tracks per-worker; extend with `tenant_id`
|
||
- Headscale ACL rules per tenant (each tenant's VMs on their own tailnet ACL)
|
||
|
||
**The enforcement model:**
|
||
```
|
||
tenant_id (DB) + VM boundary (hypervisor) + Headscale ACL (network) = full isolation
|
||
```
|
||
|
||
No single layer is sufficient alone. All three enforce the same boundary from
|
||
different angles.
|
||
|
||
---
|
||
|
||
## Multi-repo design
|
||
|
||
**The workspace repo list:**
|
||
|
||
```typescript
|
||
// SF orchestrator side
|
||
type WorkspaceSpec = {
|
||
tenant_id: string;
|
||
repos: Array<{ url: string; ref: string; mount: string }>;
|
||
credentials: CredentialRef[];
|
||
snapshot_id?: string; // resume from saved state
|
||
};
|
||
```
|
||
|
||
**Cross-repo artifact DAG (new primitive, not in either system yet):**
|
||
|
||
When a task produces artifacts spanning multiple repos, HTDAG needs to track which
|
||
commits in which repos constitute "done". This is the **cross-repo task graph** —
|
||
probably a new node type in HTDAG's DAG structure. Design deferred until the
|
||
workspace VM primitive is stable.
|
||
|
||
---
|
||
|
||
## MCP scope
|
||
|
||
Internal services use typed direct clients (gRPC for first-party). MCP is reserved
|
||
for external coding tools (Claude Code, Cursor) that don't share our build system.
|
||
See [ADR-020](./ADR-020-internal-wire-architecture.md) for the full wire-format table and rationale.
|
||
|
||
---
|
||
|
||
## Incremental convergence path
|
||
|
||
### Phase 1 — SF continues, ACE gets built (now)
|
||
- SF runs autonomous milestones on `ace-coder`. No changes to SF.
|
||
- ACE develops its HTDAG, PM, and worker primitives independently.
|
||
- Both systems mature on their own tracks.
|
||
|
||
### Phase 2 — Federated memory for ACE (near-term, ADR-012 Tier 1)
|
||
- ACE connects to singularity-memory via a typed Python client (generated from
|
||
the Go API — not MCP). Internal services do not pay the MCP tax.
|
||
- **SF stays local.** SF is single-machine, single-user, local-first by design.
|
||
`memory-store.ts` continues to work on `.sf/memory/`; no remote mode wired in
|
||
SF core. When SF runs inside an ACE-managed workspace, the workspace surfaces
|
||
federated context through the ACE engine adapter as additional KNOWLEDGE
|
||
injection — SF doesn't know that's where it came from. Federation is an ACE
|
||
concern, not a SF concern.
|
||
- The MCP façade on singularity-memory is reserved for external coding tools
|
||
(Claude Code, Cursor) that need to read/write memory while helping build the
|
||
system. Temporary scaffold; not a production wire.
|
||
- **Outcome:** federated knowledge layer operational for ACE; SF unchanged and
|
||
unaware of memory federation infrastructure.
|
||
|
||
### Phase 3 — Workspace VM opt-in for SF (medium-term)
|
||
- Build `sf-workspace` shim: thin Rust binary that manages Firecracker VMs.
|
||
- SF's `runUnit()` dispatches to workspace VM instead of raw Claude Code session
|
||
when project preference `workspace.isolation: "vm"` is set.
|
||
- Git worktree path remains for projects that haven't opted in.
|
||
- **Outcome:** SF can run multi-repo and multi-tenant workloads experimentally.
|
||
|
||
### Phase 4 — ACE workers → workspace VMs (parallel to Phase 3)
|
||
- ACE's `execution/worker.py` (async task pool) gains workspace VM dispatch path.
|
||
- ACE fills the explicit gap noted in its own competitive analysis:
|
||
*"No per-task container sandboxing... ACE has process-level sandboxing only."*
|
||
- RBAC + tenant_id at data layer + VM at execution layer = full multi-tenant ACE.
|
||
- **Outcome:** ACE can handle multi-tenant, multi-repo workloads.
|
||
|
||
### Phase 5 — Shared workspace protocol
|
||
- SF and ACE converge on the same `WorkspaceSpec` wire format.
|
||
- SF's orchestrator can dispatch to ACE's worker pool (and vice versa).
|
||
- The `sf-workspace` shim and ACE's VM dispatch path are the same binary.
|
||
- **Outcome:** two orchestrators, one execution substrate.
|
||
|
||
### Phase 6 — Pattern transfer (long-term)
|
||
**SF remains a separate, standalone app — permanently.** It is not absorbed,
|
||
re-platformed, or re-implemented inside ACE. The convergence is at the wire and
|
||
execution-substrate layers (Phases 3–5), not at the application layer.
|
||
|
||
What Phase 6 actually means:
|
||
- ACE ports proven patterns from SF — idempotency primitives, state-derivation
|
||
discipline, the structured notification model, the watchdog pattern, project
|
||
preferences as a config layer, scaffold-as-contract. These become ACE's own
|
||
primitives, written in Python, owned by ACE.
|
||
- SF stays single-machine, single-user, local-first — its character. SF gets
|
||
*generally* better as a standalone tool: better project detection, cleaner
|
||
engine adapter extension point, harder-tested crash recovery.
|
||
- SF and ACE remain independent runtimes. SF can be dispatched into an ACE
|
||
workspace (Phase 5) for multi-tenant or multi-repo work, but it is also fully
|
||
usable on its own with no ACE present.
|
||
- **Outcome:** two distinct apps that share wire schemas (singularity-grpc) and
|
||
optionally an execution substrate (Firecracker). Neither replaces the other.
|
||
|
||
---
|
||
|
||
## What is NOT in scope for this ADR
|
||
|
||
- Cross-tenant knowledge federation (single trust domain per deployment for now)
|
||
- Public-internet exposure (tailnet-only, per ADR-013)
|
||
- Replacing SF's state machine before Phase 6 — it works, don't touch it
|
||
- Choosing the agent runtime inside the VM — language-agnostic by design
|
||
- Cross-repo artifact DAG implementation — deferred to after Phase 3
|
||
|
||
---
|
||
|
||
## Risks
|
||
|
||
| Risk | Mitigation |
|
||
|------|-----------|
|
||
| Firecracker cold-start latency (~125ms) is too slow for short SF tasks | Keep git-worktree path as fallback; VMs for tasks >5min |
|
||
| VM snapshot size grows unboundedly for persistent agents | Snapshot rotation policy, same as activity log retention |
|
||
| ACE HTDAG not stable enough for Phase 5 | Phase 5 is gated on ACE reliability, not a timeline. SF works fine until then. |
|
||
| singularity-memory Go migration stalls | Phase 2 can use the Python server; migration is not on the critical path |
|
||
| Cross-repo DAG design takes longer than expected | Phases 1–5 work without it; single-repo workspace is the common case |
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- SF `docs/dev/ADR-012-multi-instance-federation.md` — federation surfaces
|
||
- SF `docs/dev/ADR-013-network-and-remote-execution.md` — tailnet + SSH workers
|
||
- SF `docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md` — Go migration (phases 0–3 only)
|
||
- ACE `docs/architecture/sf-ace-convergence.md` — this ADR from ACE's perspective
|
||
- ACE `ARCHITECTURE.md` §Sandbox — "No per-task container sandboxing" gap
|
||
- ACE `docs/architecture/data-and-storage.md` — tenant_id schema
|
||
- [Firecracker](https://firecracker-microvm.github.io/) — microVM hypervisor
|