singularity-forge/docs/dev/ADR-019-workspace-vm-convergence.md

# ADR-019: Workspace VM Convergence Architecture

**Status:** Proposed
**Date:** 2026-05-01
**Revised:** 2026-05-02 — wire-format scope superseded by ADR-020
**Deciders:** Mikael Hugo
**Context repos:** `singularity-forge` (SF), `ace-coder` (ACE)

> **Cross-repo note:** The matching document in ACE-coder lives at
> `docs/architecture/sf-ace-convergence.md`. Both documents describe the same
> architecture from each codebase's perspective. Keep them in sync when either
> changes.

---

## Context

Two autonomous agent systems are being developed in parallel:

- **SF** (`singularity-forge`) — TypeScript orchestrator. Works today. Dispatches
  Claude Code sessions as ephemeral units (milestone → slice → task). Isolation
  via git worktrees. **Single-machine, single-user, single-repo by design.** That
  scope is its character, not a limitation. SF stays a standalone app permanently;
  it does not grow into a platform.

- **ACE** (`ace-coder`) — Python platform. Partially operational. HTDAG execution
  backbone, Project Manager ownership, 20 defined agent personas, LiteLLM
  multi-provider, RBAC, PGMQ task queue, tiered memory. Multi-tenant data model
  (`tenant_id`) exists; per-task execution isolation does not. ACE is where
  multi-tenant, multi-repo, federated workloads live.

- **singularity-memory** — Separate Go service (migrating from Python per ADR-014).
  Postgres + vchord vector store. Federated knowledge layer.
  - **Internal consumers** (SF, ACE, future first-party services) talk to it via
    typed direct clients (HTTP/gRPC generated from the Go API). No MCP, no JSON-RPC
    framing, no protocol cost.
  - **External coding tools** (Claude Code, Cursor, third-party LLM clients) get
    an MCP façade. This is a temporary scaffold so external coders can read/write
    memory while they help build the system; it is not the production wire for
    internal services and is expected to shrink once the system is self-hosting.

The two systems are **not converging into one app.** They occupy different niches:

- SF is the local single-user developer tool — fast, generic, runs on the developer's
  machine on whatever repo they're working on.
- ACE is the multi-tenant platform — federated, multi-repo, scales beyond one user.

Convergence in this ADR refers to **shared substrate**, not application merging:
shared wire schemas (singularity-grpc), shared execution isolation primitive
(Firecracker workspaces) when SF chooses to dispatch into one. SF can live entirely
on its own without ACE; ACE doesn't depend on SF.

The strategy is **incremental pattern transfer**: SF continues to work as a
standalone single-user tool while autonomously helping build out ACE. ACE ports
proven patterns from SF as it matures. SF gains an optional engine adapter for
dispatching units into ACE workspaces when multi-tenant or multi-repo work is
needed. Neither replaces the other.

---

## Decision

### The unifying primitive: Workspace

```
workspace = VM (microVM) + tenant_id + [repo_1, repo_2, ...] + scoped_credentials
```

A **workspace** is the execution isolation unit for both systems. It replaces:
- SF's git worktree per milestone
- ACE's process-level `execution/worker.py` per task

A workspace is:
- **A microVM** (Firecracker) — hard process/filesystem/network isolation at the
  hypervisor level. Firecracker was built by AWS specifically for multi-tenant
  Lambda; it provides the isolation both systems need without reimplementing it.
- **Tenant-scoped** — maps to ACE's existing `tenant_id` on `agent_memory` and
  `task_queue`. The VM boundary is the enforcement point; the database tenant_id
  is the tracking point.
- **Multi-repo** — the orchestrator tells the VM which repos to clone/mount. The
  VM operates across all of them. Cross-repo work is trivially a list.
- **Credential-scoped** — per-workspace credentials (git tokens, API keys) are
  injected at VM start and destroyed at VM exit. Never shared across tenants.
- **Snapshot/restore** — VM state snapshots replace `.sf/paused-session.json` and
  ACE's `checkpoint_service`. A "persistent agent" is a named snapshot: restore it,
  the agent wakes with full memory and context intact.

### The three-layer architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│  Orchestration layer                                                 │
│                                                                      │
│  Near-term: SF (TS state machine) dispatches to workspace VMs       │
│  Long-term: ACE HTDAG/PM becomes the orchestration backbone;        │
│             SF state machine becomes an ACE PM persona              │
│                                                                      │
│  Language: TS (SF today) → Python (ACE, when reliable)             │
├─────────────────────────────────────────────────────────────────────┤
│  Knowledge layer                                                     │
│                                                                      │
│  singularity-memory: Go + Postgres + vchord                         │
│  Internal services (SF, ACE) use typed direct clients (HTTP/gRPC).  │
│  External coding tools (Claude Code, Cursor) use an MCP façade —    │
│  temporary scaffold while external coders help build the system.    │
│  Tenant-scoped knowledge banks (to be designed — see below).        │
│                                                                      │
│  Language: Go (ADR-014 migration, phases 0–3 only — NOT phase 4)   │
├─────────────────────────────────────────────────────────────────────┤
│  Execution layer                                                     │
│                                                                      │
│  workspace = VM + tenant + repos + credentials                      │
│  One workspace per dispatch unit.                                   │
│  VM management shim: Rust (Firecracker API is Rust-native).         │
│  Agent runtime inside VM: whatever the task requires.               │
│                                                                      │
│  Language: Rust (VM shim) + anything (inside VM)                   │
└─────────────────────────────────────────────────────────────────────┘
```

### ADR-014 Phase 4 is reassigned

ADR-014 proposed building a "central persistent-agent runtime" in Go using
`charmbracelet/fantasy`. This is **cancelled**. Persistent agents live as VM
snapshots managed by ACE's orchestration layer — not as a separate Go runtime.
singularity-memory (Go) scopes to the knowledge layer only (ADR-014 phases 0–3).

---

## Multi-tenant design

**What exists (ACE):**
- `tenant_id` on `agent_memory`, `task_queue` ✅
- RBAC (`rbac_capability_policy.py`, agent permission levels) ✅
- PM-driven governance and approval gates ✅

**What needs to be built:**
- Tenant-scoped knowledge banks in singularity-memory (each tenant's memory is
  isolated; cross-tenant sharing requires explicit federation grants)
- VM pool with per-tenant resource quotas (CPU, RAM, disk, LLM token budget)
- Cost accounting: LLM gateway already tracks per-worker; extend with `tenant_id`
- Headscale ACL rules per tenant (each tenant's VMs on their own tailnet ACL)

**The enforcement model:**
```
tenant_id (DB) + VM boundary (hypervisor) + Headscale ACL (network) = full isolation
```

No single layer is sufficient alone. All three enforce the same boundary from
different angles.

---

## Multi-repo design

**The workspace repo list:**

```typescript
// SF orchestrator side
type WorkspaceSpec = {
  tenant_id: string;
  repos: Array<{ url: string; ref: string; mount: string }>;
  credentials: CredentialRef[];
  snapshot_id?: string;  // resume from saved state
};
```

**Cross-repo artifact DAG (new primitive, not in either system yet):**

When a task produces artifacts spanning multiple repos, HTDAG needs to track which
commits in which repos constitute "done". This is the **cross-repo task graph** —
probably a new node type in HTDAG's DAG structure. Design deferred until the
workspace VM primitive is stable.

---

## MCP scope

> **Superseded by ADR-020:** This section's proposal to use MCP for internal service wires is replaced. ADR-020 mandates **gRPC** for first-party services (SF, ACE, memory). MCP is reserved for **external coding tools** (Claude Code, Cursor) only. The original analysis below is preserved as historical context.

[Originally proposed: MCP for internal services — superseded by ADR-020 in favor of gRPC.] Internal services use typed direct clients (gRPC for first-party). MCP is reserved
for external coding tools (Claude Code, Cursor) that don't share our build system.
See [ADR-020](./ADR-020-internal-wire-architecture.md) for the full wire-format table and rationale.

---

## Incremental convergence path

### Phase 1 — SF continues, ACE gets built (now)
- SF runs autonomous milestones on `ace-coder`. No changes to SF.
- ACE develops its HTDAG, PM, and worker primitives independently.
- Both systems mature on their own tracks.

### Phase 2 — Federated memory for ACE (near-term, ADR-012 Tier 1)
- ACE connects to singularity-memory via a typed Python client (generated from
  the Go API — not MCP). Internal services do not pay the MCP tax. [Wire format
  confirmed by ADR-020: gRPC for first-party services.]
- **SF stays local.** SF is single-machine, single-user, local-first by design.
  `memory-store.ts` continues to work on `.sf/memory/`; no remote mode wired in
  SF core. When SF runs inside an ACE-managed workspace, the workspace surfaces
  federated context through the ACE engine adapter as additional KNOWLEDGE
  injection — SF doesn't know that's where it came from. Federation is an ACE
  concern, not a SF concern.
- The MCP façade on singularity-memory is reserved for external coding tools
  (Claude Code, Cursor) that need to read/write memory while helping build the
  system. Temporary scaffold; not a production wire.
- **Outcome:** federated knowledge layer operational for ACE; SF unchanged and
  unaware of memory federation infrastructure.

### Phase 3 — Workspace VM opt-in for SF (medium-term)
- Build `sf-workspace` shim: thin Rust binary that manages Firecracker VMs.
- SF's `runUnit()` dispatches to workspace VM instead of raw Claude Code session
  when project preference `workspace.isolation: "vm"` is set.
- Git worktree path remains for projects that haven't opted in.
- **Outcome:** SF can run multi-repo and multi-tenant workloads experimentally.

### Phase 4 — ACE workers → workspace VMs (parallel to Phase 3)
- ACE's `execution/worker.py` (async task pool) gains workspace VM dispatch path.
- ACE fills the explicit gap noted in its own competitive analysis:
  *"No per-task container sandboxing... ACE has process-level sandboxing only."*
- RBAC + tenant_id at data layer + VM at execution layer = full multi-tenant ACE.
- **Outcome:** ACE can handle multi-tenant, multi-repo workloads.

### Phase 5 — Shared workspace protocol
- SF and ACE converge on the same `WorkspaceSpec` wire format.
- SF's orchestrator can dispatch to ACE's worker pool (and vice versa).
- The `sf-workspace` shim and ACE's VM dispatch path are the same binary.
- **Outcome:** two orchestrators, one execution substrate.

### Phase 6 — Pattern transfer (long-term)
**SF remains a separate, standalone app — permanently.** It is not absorbed,
re-platformed, or re-implemented inside ACE. The convergence is at the wire and
execution-substrate layers (Phases 3–5), not at the application layer.

What Phase 6 actually means:
- ACE ports proven patterns from SF — idempotency primitives, state-derivation
  discipline, the structured notification model, the watchdog pattern, project
  preferences as a config layer, scaffold-as-contract. These become ACE's own
  primitives, written in Python, owned by ACE.
- SF stays single-machine, single-user, local-first — its character. SF gets
  *generally* better as a standalone tool: better project detection, cleaner
  engine adapter extension point, harder-tested crash recovery.
- SF and ACE remain independent runtimes. SF can be dispatched into an ACE
  workspace (Phase 5) for multi-tenant or multi-repo work, but it is also fully
  usable on its own with no ACE present.
- **Outcome:** two distinct apps that share wire schemas (singularity-grpc) and
  optionally an execution substrate (Firecracker). Neither replaces the other.

---

## What is NOT in scope for this ADR

- Cross-tenant knowledge federation (single trust domain per deployment for now)
- Public-internet exposure (tailnet-only, per ADR-013)
- Replacing SF's state machine before Phase 6 — it works, don't touch it
- Choosing the agent runtime inside the VM — language-agnostic by design
- Cross-repo artifact DAG implementation — deferred to after Phase 3

---

## Risks

| Risk | Mitigation |
|------|-----------|
| Firecracker cold-start latency (~125ms) is too slow for short SF tasks | Keep git-worktree path as fallback; VMs for tasks >5min |
| VM snapshot size grows unboundedly for persistent agents | Snapshot rotation policy, same as activity log retention |
| ACE HTDAG not stable enough for Phase 5 | Phase 5 is gated on ACE reliability, not a timeline. SF works fine until then. |
| singularity-memory Go migration stalls | Phase 2 can use the Python server; migration is not on the critical path |
| Cross-repo DAG design takes longer than expected | Phases 1–5 work without it; single-repo workspace is the common case |

---

## References

- SF `docs/dev/ADR-012-multi-instance-federation.md` — federation surfaces
- SF `docs/dev/ADR-013-network-and-remote-execution.md` — tailnet + SSH workers
- SF `docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md` — Go migration (phases 0–3 only)
- ACE `docs/architecture/sf-ace-convergence.md` — this ADR from ACE's perspective
- ACE `ARCHITECTURE.md` §Sandbox — "No per-task container sandboxing" gap
- ACE `docs/architecture/data-and-storage.md` — tenant_id schema
- [Firecracker](https://firecracker-microvm.github.io/) — microVM hypervisor
-												docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:21:23 +02:00
+								# ADR-019: Workspace VM Convergence Architecture
 								**Status:** Proposed
 								**Date:** 2026-05-01
-												fix(sf): harden exit and worktree cleanup

											
										
										
											2026-05-02 09:30:14 +02:00
+								**Revised:** 2026-05-02 — wire-format scope superseded by ADR-020
-												docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:21:23 +02:00
+								**Deciders:** Mikael Hugo
 								**Context repos:** `singularity-forge` (SF), `ace-coder` (ACE)
 								> **Cross-repo note:** The matching document in ACE-coder lives at
 								> `docs/architecture/sf-ace-convergence.md`. Both documents describe the same
 								> architecture from each codebase's perspective. Keep them in sync when either
 								> changes.
 								---
 								## Context
 								Two autonomous agent systems are being developed in parallel:
 								- **SF** (`singularity-forge`) — TypeScript orchestrator. Works today. Dispatches
 								  Claude Code sessions as ephemeral units (milestone → slice → task). Isolation
-												feat: SF stays standalone forever; strengthen Python/Rust detection

ADR-019 framing corrections:
- SF is single-machine, single-user, single-repo by design — character, not
  limitation. Stays a standalone app permanently; does not get absorbed into ACE.
- Phase 6 reframed: "pattern transfer" not "orchestration convergence." ACE
  ports patterns from SF, both apps remain independent.
- Phase 2 reframed: SF stays local. Federation is an ACE concern; SF doesn't
  wire memory-store remote-mode against singularity-memory.

Detection strengthened for Python (priority for ace-coder work):
- Detect uv / poetry / pdm and prefix verification commands accordingly
- Emit ruff check when configured (file or [tool.ruff] in pyproject.toml)
- Emit mypy / pyright when configured — skip when no config to avoid false fails
- pyprojectHasTool helper for [tool.<name>] section detection

Detection strengthened for Rust:
- cargo fmt --check (fastest, catches style first)
- cargo check (type-only, faster than test)
- cargo clippy -- -D warnings (warnings as errors)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:48:17 +02:00
+								  via git worktrees. **Single-machine, single-user, single-repo by design.** That
 								  scope is its character, not a limitation. SF stays a standalone app permanently;
 								  it does not grow into a platform.
-												docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:21:23 +02:00
 								- **ACE** (`ace-coder`) — Python platform. Partially operational. HTDAG execution
 								  backbone, Project Manager ownership, 20 defined agent personas, LiteLLM
 								  multi-provider, RBAC, PGMQ task queue, tiered memory. Multi-tenant data model
-												feat: SF stays standalone forever; strengthen Python/Rust detection

ADR-019 framing corrections:
- SF is single-machine, single-user, single-repo by design — character, not
  limitation. Stays a standalone app permanently; does not get absorbed into ACE.
- Phase 6 reframed: "pattern transfer" not "orchestration convergence." ACE
  ports patterns from SF, both apps remain independent.
- Phase 2 reframed: SF stays local. Federation is an ACE concern; SF doesn't
  wire memory-store remote-mode against singularity-memory.

Detection strengthened for Python (priority for ace-coder work):
- Detect uv / poetry / pdm and prefix verification commands accordingly
- Emit ruff check when configured (file or [tool.ruff] in pyproject.toml)
- Emit mypy / pyright when configured — skip when no config to avoid false fails
- pyprojectHasTool helper for [tool.<name>] section detection

Detection strengthened for Rust:
- cargo fmt --check (fastest, catches style first)
- cargo check (type-only, faster than test)
- cargo clippy -- -D warnings (warnings as errors)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:48:17 +02:00
+								  (`tenant_id`) exists; per-task execution isolation does not. ACE is where
 								  multi-tenant, multi-repo, federated workloads live.
-												docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:21:23 +02:00
 								- **singularity-memory** — Separate Go service (migrating from Python per ADR-014).
-												ADR-019: clarify MCP is a temporary external-coder scaffold, not production wire

Internal services (SF↔memory, ACE↔memory, SF↔ACE) talk via typed direct
clients generated from the Go/TS APIs — HTTP/gRPC for memory, existing
JSON-RPC stdio for SF↔ACE. MCP is reserved for external LLM-driven coding
tools (Claude Code, Cursor) that don't share our build system; it is a
scaffold for the period when external coders help build the platform and
shrinks as the system becomes self-hosting.

Adds an explicit "MCP scope" table so the rule is stated once. Updates the
three-layer architecture diagram, Phase 2, and Phase 6 to remove the
inaccurate "all consumers over MCP" framing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:38:25 +02:00
+								  Postgres + vchord vector store. Federated knowledge layer.
 								  - **Internal consumers** (SF, ACE, future first-party services) talk to it via
 								    typed direct clients (HTTP/gRPC generated from the Go API). No MCP, no JSON-RPC
 								    framing, no protocol cost.
 								  - **External coding tools** (Claude Code, Cursor, third-party LLM clients) get
 								    an MCP façade. This is a temporary scaffold so external coders can read/write
 								    memory while they help build the system; it is not the production wire for
 								    internal services and is expected to shrink once the system is self-hosting.
-												docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:21:23 +02:00
-												feat: SF stays standalone forever; strengthen Python/Rust detection

ADR-019 framing corrections:
- SF is single-machine, single-user, single-repo by design — character, not
  limitation. Stays a standalone app permanently; does not get absorbed into ACE.
- Phase 6 reframed: "pattern transfer" not "orchestration convergence." ACE
  ports patterns from SF, both apps remain independent.
- Phase 2 reframed: SF stays local. Federation is an ACE concern; SF doesn't
  wire memory-store remote-mode against singularity-memory.

Detection strengthened for Python (priority for ace-coder work):
- Detect uv / poetry / pdm and prefix verification commands accordingly
- Emit ruff check when configured (file or [tool.ruff] in pyproject.toml)
- Emit mypy / pyright when configured — skip when no config to avoid false fails
- pyprojectHasTool helper for [tool.<name>] section detection

Detection strengthened for Rust:
- cargo fmt --check (fastest, catches style first)
- cargo check (type-only, faster than test)
- cargo clippy -- -D warnings (warnings as errors)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:48:17 +02:00
+								The two systems are **not converging into one app.** They occupy different niches:
-												docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:21:23 +02:00
-												feat: SF stays standalone forever; strengthen Python/Rust detection

ADR-019 framing corrections:
- SF is single-machine, single-user, single-repo by design — character, not
  limitation. Stays a standalone app permanently; does not get absorbed into ACE.
- Phase 6 reframed: "pattern transfer" not "orchestration convergence." ACE
  ports patterns from SF, both apps remain independent.
- Phase 2 reframed: SF stays local. Federation is an ACE concern; SF doesn't
  wire memory-store remote-mode against singularity-memory.

Detection strengthened for Python (priority for ace-coder work):
- Detect uv / poetry / pdm and prefix verification commands accordingly
- Emit ruff check when configured (file or [tool.ruff] in pyproject.toml)
- Emit mypy / pyright when configured — skip when no config to avoid false fails
- pyprojectHasTool helper for [tool.<name>] section detection

Detection strengthened for Rust:
- cargo fmt --check (fastest, catches style first)
- cargo check (type-only, faster than test)
- cargo clippy -- -D warnings (warnings as errors)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:48:17 +02:00
+								- SF is the local single-user developer tool — fast, generic, runs on the developer's
 								  machine on whatever repo they're working on.
 								- ACE is the multi-tenant platform — federated, multi-repo, scales beyond one user.
 								Convergence in this ADR refers to **shared substrate**, not application merging:
 								shared wire schemas (singularity-grpc), shared execution isolation primitive
 								(Firecracker workspaces) when SF chooses to dispatch into one. SF can live entirely
 								on its own without ACE; ACE doesn't depend on SF.
 								The strategy is **incremental pattern transfer**: SF continues to work as a
 								standalone single-user tool while autonomously helping build out ACE. ACE ports
 								proven patterns from SF as it matures. SF gains an optional engine adapter for
 								dispatching units into ACE workspaces when multi-tenant or multi-repo work is
 								needed. Neither replaces the other.
-												docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:21:23 +02:00
 								---
 								## Decision
 								### The unifying primitive: Workspace
 								```
 								workspace = VM (microVM) + tenant_id + [repo_1, repo_2, ...] + scoped_credentials
 								```
 								A **workspace** is the execution isolation unit for both systems. It replaces:
 								- SF's git worktree per milestone
 								- ACE's process-level `execution/worker.py` per task
 								A workspace is:
 								- **A microVM** (Firecracker) — hard process/filesystem/network isolation at the
 								  hypervisor level. Firecracker was built by AWS specifically for multi-tenant
 								  Lambda; it provides the isolation both systems need without reimplementing it.
 								- **Tenant-scoped** — maps to ACE's existing `tenant_id` on `agent_memory` and
 								  `task_queue`. The VM boundary is the enforcement point; the database tenant_id
 								  is the tracking point.
 								- **Multi-repo** — the orchestrator tells the VM which repos to clone/mount. The
 								  VM operates across all of them. Cross-repo work is trivially a list.
 								- **Credential-scoped** — per-workspace credentials (git tokens, API keys) are
 								  injected at VM start and destroyed at VM exit. Never shared across tenants.
 								- **Snapshot/restore** — VM state snapshots replace `.sf/paused-session.json` and
 								  ACE's `checkpoint_service`. A "persistent agent" is a named snapshot: restore it,
 								  the agent wakes with full memory and context intact.
 								### The three-layer architecture
 								```
 								┌─────────────────────────────────────────────────────────────────────┐
 								│  Orchestration layer                                                 │
 								│                                                                      │
 								│  Near-term: SF (TS state machine) dispatches to workspace VMs       │
 								│  Long-term: ACE HTDAG/PM becomes the orchestration backbone;        │
 								│             SF state machine becomes an ACE PM persona              │
 								│                                                                      │
 								│  Language: TS (SF today) → Python (ACE, when reliable)             │
 								├─────────────────────────────────────────────────────────────────────┤
 								│  Knowledge layer                                                     │
 								│                                                                      │
-												ADR-019: clarify MCP is a temporary external-coder scaffold, not production wire

Internal services (SF↔memory, ACE↔memory, SF↔ACE) talk via typed direct
clients generated from the Go/TS APIs — HTTP/gRPC for memory, existing
JSON-RPC stdio for SF↔ACE. MCP is reserved for external LLM-driven coding
tools (Claude Code, Cursor) that don't share our build system; it is a
scaffold for the period when external coders help build the platform and
shrinks as the system becomes self-hosting.

Adds an explicit "MCP scope" table so the rule is stated once. Updates the
three-layer architecture diagram, Phase 2, and Phase 6 to remove the
inaccurate "all consumers over MCP" framing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:38:25 +02:00
+								│  singularity-memory: Go + Postgres + vchord                         │
 								│  Internal services (SF, ACE) use typed direct clients (HTTP/gRPC).  │
 								│  External coding tools (Claude Code, Cursor) use an MCP façade —    │
 								│  temporary scaffold while external coders help build the system.    │
-												docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:21:23 +02:00
+								│  Tenant-scoped knowledge banks (to be designed — see below).        │
 								│                                                                      │
 								│  Language: Go (ADR-014 migration, phases 0–3 only — NOT phase 4)   │
 								├─────────────────────────────────────────────────────────────────────┤
 								│  Execution layer                                                     │
 								│                                                                      │
 								│  workspace = VM + tenant + repos + credentials                      │
 								│  One workspace per dispatch unit.                                   │
 								│  VM management shim: Rust (Firecracker API is Rust-native).         │
 								│  Agent runtime inside VM: whatever the task requires.               │
 								│                                                                      │
 								│  Language: Rust (VM shim) + anything (inside VM)                   │
 								└─────────────────────────────────────────────────────────────────────┘
 								```
 								### ADR-014 Phase 4 is reassigned
 								ADR-014 proposed building a "central persistent-agent runtime" in Go using
 								`charmbracelet/fantasy`. This is **cancelled**. Persistent agents live as VM
 								snapshots managed by ACE's orchestration layer — not as a separate Go runtime.
 								singularity-memory (Go) scopes to the knowledge layer only (ADR-014 phases 0–3).
 								---
 								## Multi-tenant design
 								**What exists (ACE):**
 								- `tenant_id` on `agent_memory`, `task_queue` ✅
 								- RBAC (`rbac_capability_policy.py`, agent permission levels) ✅
 								- PM-driven governance and approval gates ✅
 								**What needs to be built:**
 								- Tenant-scoped knowledge banks in singularity-memory (each tenant's memory is
 								  isolated; cross-tenant sharing requires explicit federation grants)
 								- VM pool with per-tenant resource quotas (CPU, RAM, disk, LLM token budget)
 								- Cost accounting: LLM gateway already tracks per-worker; extend with `tenant_id`
 								- Headscale ACL rules per tenant (each tenant's VMs on their own tailnet ACL)
 								**The enforcement model:**
 								```
 								tenant_id (DB) + VM boundary (hypervisor) + Headscale ACL (network) = full isolation
 								```
 								No single layer is sufficient alone. All three enforce the same boundary from
 								different angles.
 								---
 								## Multi-repo design
 								**The workspace repo list:**
 								```typescript
 								// SF orchestrator side
 								type WorkspaceSpec = {
 								  tenant_id: string;
 								  repos: Array<{ url: string; ref: string; mount: string }>;
 								  credentials: CredentialRef[];
 								  snapshot_id?: string;  // resume from saved state
 								};
 								```
 								**Cross-repo artifact DAG (new primitive, not in either system yet):**
 								When a task produces artifacts spanning multiple repos, HTDAG needs to track which
 								commits in which repos constitute "done". This is the **cross-repo task graph** —
 								probably a new node type in HTDAG's DAG structure. Design deferred until the
 								workspace VM primitive is stable.
 								---
-												ADR-019: clarify MCP is a temporary external-coder scaffold, not production wire

Internal services (SF↔memory, ACE↔memory, SF↔ACE) talk via typed direct
clients generated from the Go/TS APIs — HTTP/gRPC for memory, existing
JSON-RPC stdio for SF↔ACE. MCP is reserved for external LLM-driven coding
tools (Claude Code, Cursor) that don't share our build system; it is a
scaffold for the period when external coders help build the platform and
shrinks as the system becomes self-hosting.

Adds an explicit "MCP scope" table so the rule is stated once. Updates the
three-layer architecture diagram, Phase 2, and Phase 6 to remove the
inaccurate "all consumers over MCP" framing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:38:25 +02:00
+								## MCP scope
-												fix(sf): harden exit and worktree cleanup

											
										
										
											2026-05-02 09:30:14 +02:00
+								> **Superseded by ADR-020:** This section's proposal to use MCP for internal service wires is replaced. ADR-020 mandates **gRPC** for first-party services (SF, ACE, memory). MCP is reserved for **external coding tools** (Claude Code, Cursor) only. The original analysis below is preserved as historical context.
 								[Originally proposed: MCP for internal services — superseded by ADR-020 in favor of gRPC.] Internal services use typed direct clients (gRPC for first-party). MCP is reserved
-												feat: SF strengthening + ADR-020 wire architecture (Phases 1-2)

Phase 1 — close SF-side polish gaps:

- codebase-generator: distinguish uv/poetry/pdm in Python stack-signals;
  surface configured tooling (ruff/mypy/pyright) when config files exist
- doctor-environment: new checkPythonEnvironment — detects uv/poetry/pdm
  via lockfile, verifies binary on PATH, warns with install hint when missing
- doctor-environment: new checkSiftAvailable — recommends sift install for
  repos > 5000 source files when not on PATH
- tech-debt-tracker: documented future memory-as-sub-extension extraction
  (defer until real backend-swap requirement)

Phase 2 — internal wire architecture:

- ADR-020: singularity-grpc as shared schema repo; gRPC + typed clients
  for first-party services; MCP façade only at external-tool boundary
- ADR-019: trimmed MCP scope section to a 3-line summary linking to ADR-020
  to avoid the wire-format table living in two places
- design-docs/index.md: ADR-020 added to ADR table

These changes make SF stronger for autonomous work on Python repos
(particularly ace-coder) and capture the internal wire architecture
decision as a durable ADR before any singularity-grpc code lands.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-02 00:03:34 +02:00
+								for external coding tools (Claude Code, Cursor) that don't share our build system.
 								See [ADR-020](./ADR-020-internal-wire-architecture.md) for the full wire-format table and rationale.
-												ADR-019: clarify MCP is a temporary external-coder scaffold, not production wire

Internal services (SF↔memory, ACE↔memory, SF↔ACE) talk via typed direct
clients generated from the Go/TS APIs — HTTP/gRPC for memory, existing
JSON-RPC stdio for SF↔ACE. MCP is reserved for external LLM-driven coding
tools (Claude Code, Cursor) that don't share our build system; it is a
scaffold for the period when external coders help build the platform and
shrinks as the system becomes self-hosting.

Adds an explicit "MCP scope" table so the rule is stated once. Updates the
three-layer architecture diagram, Phase 2, and Phase 6 to remove the
inaccurate "all consumers over MCP" framing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:38:25 +02:00
 								---
-												docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:21:23 +02:00
+								## Incremental convergence path
 								### Phase 1 — SF continues, ACE gets built (now)
 								- SF runs autonomous milestones on `ace-coder`. No changes to SF.
 								- ACE develops its HTDAG, PM, and worker primitives independently.
 								- Both systems mature on their own tracks.
-												feat: SF stays standalone forever; strengthen Python/Rust detection

ADR-019 framing corrections:
- SF is single-machine, single-user, single-repo by design — character, not
  limitation. Stays a standalone app permanently; does not get absorbed into ACE.
- Phase 6 reframed: "pattern transfer" not "orchestration convergence." ACE
  ports patterns from SF, both apps remain independent.
- Phase 2 reframed: SF stays local. Federation is an ACE concern; SF doesn't
  wire memory-store remote-mode against singularity-memory.

Detection strengthened for Python (priority for ace-coder work):
- Detect uv / poetry / pdm and prefix verification commands accordingly
- Emit ruff check when configured (file or [tool.ruff] in pyproject.toml)
- Emit mypy / pyright when configured — skip when no config to avoid false fails
- pyprojectHasTool helper for [tool.<name>] section detection

Detection strengthened for Rust:
- cargo fmt --check (fastest, catches style first)
- cargo check (type-only, faster than test)
- cargo clippy -- -D warnings (warnings as errors)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:48:17 +02:00
+								### Phase 2 — Federated memory for ACE (near-term, ADR-012 Tier 1)
 								- ACE connects to singularity-memory via a typed Python client (generated from
-												fix(sf): harden exit and worktree cleanup

											
										
										
											2026-05-02 09:30:14 +02:00
+								  the Go API — not MCP). Internal services do not pay the MCP tax. [Wire format
 								  confirmed by ADR-020: gRPC for first-party services.]
-												feat: SF stays standalone forever; strengthen Python/Rust detection

ADR-019 framing corrections:
- SF is single-machine, single-user, single-repo by design — character, not
  limitation. Stays a standalone app permanently; does not get absorbed into ACE.
- Phase 6 reframed: "pattern transfer" not "orchestration convergence." ACE
  ports patterns from SF, both apps remain independent.
- Phase 2 reframed: SF stays local. Federation is an ACE concern; SF doesn't
  wire memory-store remote-mode against singularity-memory.

Detection strengthened for Python (priority for ace-coder work):
- Detect uv / poetry / pdm and prefix verification commands accordingly
- Emit ruff check when configured (file or [tool.ruff] in pyproject.toml)
- Emit mypy / pyright when configured — skip when no config to avoid false fails
- pyprojectHasTool helper for [tool.<name>] section detection

Detection strengthened for Rust:
- cargo fmt --check (fastest, catches style first)
- cargo check (type-only, faster than test)
- cargo clippy -- -D warnings (warnings as errors)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:48:17 +02:00
+								- **SF stays local.** SF is single-machine, single-user, local-first by design.
 								  `memory-store.ts` continues to work on `.sf/memory/`; no remote mode wired in
 								  SF core. When SF runs inside an ACE-managed workspace, the workspace surfaces
 								  federated context through the ACE engine adapter as additional KNOWLEDGE
 								  injection — SF doesn't know that's where it came from. Federation is an ACE
 								  concern, not a SF concern.
-												ADR-019: clarify MCP is a temporary external-coder scaffold, not production wire

Internal services (SF↔memory, ACE↔memory, SF↔ACE) talk via typed direct
clients generated from the Go/TS APIs — HTTP/gRPC for memory, existing
JSON-RPC stdio for SF↔ACE. MCP is reserved for external LLM-driven coding
tools (Claude Code, Cursor) that don't share our build system; it is a
scaffold for the period when external coders help build the platform and
shrinks as the system becomes self-hosting.

Adds an explicit "MCP scope" table so the rule is stated once. Updates the
three-layer architecture diagram, Phase 2, and Phase 6 to remove the
inaccurate "all consumers over MCP" framing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:38:25 +02:00
+								- The MCP façade on singularity-memory is reserved for external coding tools
 								  (Claude Code, Cursor) that need to read/write memory while helping build the
 								  system. Temporary scaffold; not a production wire.
-												feat: SF stays standalone forever; strengthen Python/Rust detection

ADR-019 framing corrections:
- SF is single-machine, single-user, single-repo by design — character, not
  limitation. Stays a standalone app permanently; does not get absorbed into ACE.
- Phase 6 reframed: "pattern transfer" not "orchestration convergence." ACE
  ports patterns from SF, both apps remain independent.
- Phase 2 reframed: SF stays local. Federation is an ACE concern; SF doesn't
  wire memory-store remote-mode against singularity-memory.

Detection strengthened for Python (priority for ace-coder work):
- Detect uv / poetry / pdm and prefix verification commands accordingly
- Emit ruff check when configured (file or [tool.ruff] in pyproject.toml)
- Emit mypy / pyright when configured — skip when no config to avoid false fails
- pyprojectHasTool helper for [tool.<name>] section detection

Detection strengthened for Rust:
- cargo fmt --check (fastest, catches style first)
- cargo check (type-only, faster than test)
- cargo clippy -- -D warnings (warnings as errors)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:48:17 +02:00
+								- **Outcome:** federated knowledge layer operational for ACE; SF unchanged and
 								  unaware of memory federation infrastructure.
-												docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:21:23 +02:00
 								### Phase 3 — Workspace VM opt-in for SF (medium-term)
 								- Build `sf-workspace` shim: thin Rust binary that manages Firecracker VMs.
 								- SF's `runUnit()` dispatches to workspace VM instead of raw Claude Code session
 								  when project preference `workspace.isolation: "vm"` is set.
 								- Git worktree path remains for projects that haven't opted in.
 								- **Outcome:** SF can run multi-repo and multi-tenant workloads experimentally.
 								### Phase 4 — ACE workers → workspace VMs (parallel to Phase 3)
 								- ACE's `execution/worker.py` (async task pool) gains workspace VM dispatch path.
 								- ACE fills the explicit gap noted in its own competitive analysis:
 								  *"No per-task container sandboxing... ACE has process-level sandboxing only."*
 								- RBAC + tenant_id at data layer + VM at execution layer = full multi-tenant ACE.
 								- **Outcome:** ACE can handle multi-tenant, multi-repo workloads.
 								### Phase 5 — Shared workspace protocol
 								- SF and ACE converge on the same `WorkspaceSpec` wire format.
 								- SF's orchestrator can dispatch to ACE's worker pool (and vice versa).
 								- The `sf-workspace` shim and ACE's VM dispatch path are the same binary.
 								- **Outcome:** two orchestrators, one execution substrate.
-												feat: SF stays standalone forever; strengthen Python/Rust detection

ADR-019 framing corrections:
- SF is single-machine, single-user, single-repo by design — character, not
  limitation. Stays a standalone app permanently; does not get absorbed into ACE.
- Phase 6 reframed: "pattern transfer" not "orchestration convergence." ACE
  ports patterns from SF, both apps remain independent.
- Phase 2 reframed: SF stays local. Federation is an ACE concern; SF doesn't
  wire memory-store remote-mode against singularity-memory.

Detection strengthened for Python (priority for ace-coder work):
- Detect uv / poetry / pdm and prefix verification commands accordingly
- Emit ruff check when configured (file or [tool.ruff] in pyproject.toml)
- Emit mypy / pyright when configured — skip when no config to avoid false fails
- pyprojectHasTool helper for [tool.<name>] section detection

Detection strengthened for Rust:
- cargo fmt --check (fastest, catches style first)
- cargo check (type-only, faster than test)
- cargo clippy -- -D warnings (warnings as errors)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:48:17 +02:00
+								### Phase 6 — Pattern transfer (long-term)
 								**SF remains a separate, standalone app — permanently.** It is not absorbed,
 								re-platformed, or re-implemented inside ACE. The convergence is at the wire and
 								execution-substrate layers (Phases 3–5), not at the application layer.
 								What Phase 6 actually means:
 								- ACE ports proven patterns from SF — idempotency primitives, state-derivation
 								  discipline, the structured notification model, the watchdog pattern, project
 								  preferences as a config layer, scaffold-as-contract. These become ACE's own
 								  primitives, written in Python, owned by ACE.
 								- SF stays single-machine, single-user, local-first — its character. SF gets
 								  *generally* better as a standalone tool: better project detection, cleaner
 								  engine adapter extension point, harder-tested crash recovery.
 								- SF and ACE remain independent runtimes. SF can be dispatched into an ACE
 								  workspace (Phase 5) for multi-tenant or multi-repo work, but it is also fully
 								  usable on its own with no ACE present.
 								- **Outcome:** two distinct apps that share wire schemas (singularity-grpc) and
 								  optionally an execution substrate (Firecracker). Neither replaces the other.
-												docs: add ADR-019 workspace VM convergence architecture

Captures the SF↔ACE incremental convergence strategy: workspace VMs
(Firecracker) as the unified execution isolation primitive, the three-layer
architecture (orchestration/knowledge/execution), the 6-phase convergence
path, and ADR-014 Phase 4 cancellation (persistent-agent runtime reassigned
to ACE). Cross-references the matching ACE document at
docs/architecture/sf-ace-convergence.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-05-01 23:21:23 +02:00
 								---
 								## What is NOT in scope for this ADR
 								- Cross-tenant knowledge federation (single trust domain per deployment for now)
 								- Public-internet exposure (tailnet-only, per ADR-013)
 								- Replacing SF's state machine before Phase 6 — it works, don't touch it
 								- Choosing the agent runtime inside the VM — language-agnostic by design
 								- Cross-repo artifact DAG implementation — deferred to after Phase 3
 								---
 								## Risks
 								| Risk | Mitigation |
 								|------|-----------|
 								| Firecracker cold-start latency (~125ms) is too slow for short SF tasks | Keep git-worktree path as fallback; VMs for tasks >5min |
 								| VM snapshot size grows unboundedly for persistent agents | Snapshot rotation policy, same as activity log retention |
 								| ACE HTDAG not stable enough for Phase 5 | Phase 5 is gated on ACE reliability, not a timeline. SF works fine until then. |
 								| singularity-memory Go migration stalls | Phase 2 can use the Python server; migration is not on the critical path |
 								| Cross-repo DAG design takes longer than expected | Phases 1–5 work without it; single-repo workspace is the common case |
 								---
 								## References
 								- SF `docs/dev/ADR-012-multi-instance-federation.md` — federation surfaces
 								- SF `docs/dev/ADR-013-network-and-remote-execution.md` — tailnet + SSH workers
 								- SF `docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md` — Go migration (phases 0–3 only)
 								- ACE `docs/architecture/sf-ace-convergence.md` — this ADR from ACE's perspective
 								- ACE `ARCHITECTURE.md` §Sandbox — "No per-task container sandboxing" gap
 								- ACE `docs/architecture/data-and-storage.md` — tenant_id schema
 								- [Firecracker](https://firecracker-microvm.github.io/) — microVM hypervisor