singularity-forge/docs/dev/ADR-013-network-and-remote-execution.md

# ADR-013: Network and remote-execution layer

**Date**: 2026-04-29
**Status**: proposed (deferred — capture for staged execution)

## Context

sf today runs as a single daemon per host. Three forces push it toward a multi-host topology:

- **SSH workers**: the orchestrator dispatches unit attempts to remote hosts (GPU, Windows, parallel scaling) — needs an SSH-served worker process.
- **Singularity Memory remote-mode** (ADR-012, ADR-014): the cross-instance knowledge layer runs as a service on the tailnet, reachable from SF and other clients.
- **Multi-instance federation** (ADR-012): future federated agents and benchmarks ride the same network substrate.

This ADR fixes the network and SSH-execution layer the above all depend on.

## Decision

- **Network substrate: tailnet** — Tailscale wire protocol with **Headscale** as the self-hosted control plane (the user already runs Headscale at `mikki-bunker`). sf core is wire-agnostic; it assumes addressable, authenticated peers.
- **SSH worker host stack: Go + `charmbracelet/wish` + `charmbracelet/x/xpty`** (Linux/macOS) and **`charmbracelet/x/conpty`** (Windows). One thin Go shim per worker host; orchestrator (TS) talks SSH stdio to it.
- **Worker observability: `charmbracelet/promwish`** — Prometheus middleware mounted on Wish gives `/metrics` for free.
- **Worker identity: `charmbracelet/x/sshkey` + `charmbracelet/melt`** — auto-provisioning + Ed25519-with-seed-words backup.

## Alternatives Considered

### Network substrate

- **Public internet + sshd + manual key management** — works, but key sprawl is a real problem (each new host adds N×M keys), and dynamic IPs break stable hostnames. Tailnet's MagicDNS + ACLs replace both. Rejected.
- **Plain WireGuard mesh** — no control plane; manual peer config. Higher ops overhead than Headscale. Rejected.
- **Tailscale-the-service** — fine, but Headscale is already running and self-hosted means full ownership. Rejected.
- **ZeroTier / Netbird** — viable alternatives. Rejected because the user already has Headscale and switching costs nothing-to-gain.

### SSH worker stack

- **Node-based SSH server (`ssh2` lib)** — keeps everything TS but reinvents what Wish gives for free; no battle-tested middleware patterns. Rejected.
- **OpenSSH `sshd` with `ForceCommand`** — works for simple cases, terrible for multiplexed agent dispatch with per-connection state. Rejected.
- **Plain Go `crypto/ssh`** — lower-level than Wish, no middleware, no built-in metrics. Rejected — Wish wraps the right primitives.

## Consequences

**Positive**

- sf's network model is **explicit**: tailnet first, ACLs in Headscale's admin, no per-service auth invention.
- SSH worker host inherits Wish's mature middleware (`wish/logging`, `wish/elapsed`, etc.) and `promwish` observability.
- Cross-platform pty support (`xpty` Linux/macOS, `conpty` Windows) lets workers spawn real ttys for the agent — load-bearing for Windows-only test runs on `mikki-bunker-windows`.
- Stable hostnames via Headscale's MagicDNS — `mikki-bunker.tailnet.ts.hugo.dk` resolves regardless of network change.
- Identity story is clean: each worker host has its own Ed25519 keypair (`sshkey`), backed up via `melt` seed words.

**Negative**

- Tailnet dependency: when Headscale is down, *new* connections can't auth (existing connections survive). Mitigation: Headscale on a stable host with monitoring.
- Polyglot deployment: TS orchestrator + Go worker. One clean SSH-stdio boundary, but two languages to keep in CI. Acceptable per ADR-016 (parallel build).
- ACL drift: if Headscale ACLs forbid a worker host, sf degrades silently. Doctor-check should detect and surface explicitly (see "implementation" below).

**Risks and mitigations**

- *Risk:* SSH disconnect mid-turn produces zombie agent processes.
  - *Mitigation:* worker cleanup script on disconnect; `--sf-run-id=<id>` marker on the agent process for `pgrep` / `kill`.
- *Risk:* `wish` API churn pre-1.0.
  - *Mitigation:* pin a version; planned upgrade window once per quarter.
- *Risk:* `xpty` / `conpty` edge cases on niche shells.
  - *Mitigation:* worker has a flag to fall back to non-pty stdio; logged loudly.

## Out of Scope

- **Multi-tenant network isolation** (one tailnet, multiple users with separate ACL domains) — defer until concrete need.
- **Public-internet exposure** — sf is tailnet-only by deployment recommendation. If a use case needs a public endpoint, it goes through `tailscale funnel` or a dedicated reverse proxy outside sf.
- **Cross-tailnet federation** — out of scope; one tailnet per deployment.

## Sequencing

| When | Action |
|---|---|
| Now | Capture this ADR as the deployment assumption. |
| Tier 1 (next 1–3 months) | Build sf-worker (Go + Wish + xpty/conpty + promwish) as a separate package or repo. Orchestrator-side dispatch path in TS already plans for `worker_host` per SPEC §22 — just point it at the SSH endpoint. |
| Tier 2 | Doctor check: validate tailnet ACL allows the orchestrator → all configured worker hosts. Surface failures in `sf doctor`. |
| Tier 3 | Worker auto-provisioning script: `sf worker bootstrap <host>` generates a key, registers with Headscale, drops the worker binary. |

## Implementation Sketch

```
[sf orchestrator (TS)]                       on the daemon host
        │
        │  ssh user@worker.tailnet.ts.hugo.dk  --  carries sf-rpc envelope
        │
        ▼
[sf-worker (Go)]                             on each worker tailnet node
  ├── wish.Server                            with logging + elapsed + promwish middleware
  ├── per-connection handler                 spawns the agent via xpty/conpty
  ├── /metrics                               via promwish — scraped by your Prometheus
  └── /healthz, /readyz                      simple HTTP for orchestrator health checks
```

The worker is **stateless** — claim, lease, retry, persistence are all the orchestrator's job. Older SPEC notes captured this as distributed-execution evidence; current implementation must persist accepted requirements through `.sf`/DB-backed state. The worker just executes one attempt at a time and streams output.

## References

- Older distributed-execution SPEC notes — external design evidence only; project accepted facts into `.sf`/DB-backed state before treating them as operational.
- `ADR-012` — Multi-instance federation (this ADR provides the substrate).
- `ADR-014` — Singularity Knowledge + Agent Platform (deploys onto this substrate).
- `ADR-016` — Charm AI stack adoption strategy (frames why Go for new services).
- `charmbracelet/wish` — SSH server framework.
- `charmbracelet/x/xpty`, `charmbracelet/x/conpty` — pty primitives.
- `charmbracelet/promwish` — Prometheus middleware for Wish.
- Headscale — open-source Tailscale control plane.
-												feat: add SF skills and subagent debate mode

											
										
										
											2026-04-29 17:43:30 +02:00
+								# ADR-013: Network and remote-execution layer
 								**Date**: 2026-04-29
 								**Status**: proposed (deferred — capture for staged execution)
 								## Context
 								sf today runs as a single daemon per host. Three forces push it toward a multi-host topology:
-												sf snapshot: uncommitted changes after 110m inactivity

											
										
										
											2026-05-08 00:17:47 +02:00
+								- **SSH workers**: the orchestrator dispatches unit attempts to remote hosts (GPU, Windows, parallel scaling) — needs an SSH-served worker process.
 								- **Singularity Memory remote-mode** (ADR-012, ADR-014): the cross-instance knowledge layer runs as a service on the tailnet, reachable from SF and other clients.
-												feat: add SF skills and subagent debate mode

											
										
										
											2026-04-29 17:43:30 +02:00
+								- **Multi-instance federation** (ADR-012): future federated agents and benchmarks ride the same network substrate.
 								This ADR fixes the network and SSH-execution layer the above all depend on.
 								## Decision
 								- **Network substrate: tailnet** — Tailscale wire protocol with **Headscale** as the self-hosted control plane (the user already runs Headscale at `mikki-bunker`). sf core is wire-agnostic; it assumes addressable, authenticated peers.
 								- **SSH worker host stack: Go + `charmbracelet/wish` + `charmbracelet/x/xpty`** (Linux/macOS) and **`charmbracelet/x/conpty`** (Windows). One thin Go shim per worker host; orchestrator (TS) talks SSH stdio to it.
 								- **Worker observability: `charmbracelet/promwish`** — Prometheus middleware mounted on Wish gives `/metrics` for free.
 								- **Worker identity: `charmbracelet/x/sshkey` + `charmbracelet/melt`** — auto-provisioning + Ed25519-with-seed-words backup.
 								## Alternatives Considered
 								### Network substrate
 								- **Public internet + sshd + manual key management** — works, but key sprawl is a real problem (each new host adds N×M keys), and dynamic IPs break stable hostnames. Tailnet's MagicDNS + ACLs replace both. Rejected.
 								- **Plain WireGuard mesh** — no control plane; manual peer config. Higher ops overhead than Headscale. Rejected.
 								- **Tailscale-the-service** — fine, but Headscale is already running and self-hosted means full ownership. Rejected.
 								- **ZeroTier / Netbird** — viable alternatives. Rejected because the user already has Headscale and switching costs nothing-to-gain.
 								### SSH worker stack
 								- **Node-based SSH server (`ssh2` lib)** — keeps everything TS but reinvents what Wish gives for free; no battle-tested middleware patterns. Rejected.
 								- **OpenSSH `sshd` with `ForceCommand`** — works for simple cases, terrible for multiplexed agent dispatch with per-connection state. Rejected.
 								- **Plain Go `crypto/ssh`** — lower-level than Wish, no middleware, no built-in metrics. Rejected — Wish wraps the right primitives.
 								## Consequences
 								**Positive**
 								- sf's network model is **explicit**: tailnet first, ACLs in Headscale's admin, no per-service auth invention.
 								- SSH worker host inherits Wish's mature middleware (`wish/logging`, `wish/elapsed`, etc.) and `promwish` observability.
 								- Cross-platform pty support (`xpty` Linux/macOS, `conpty` Windows) lets workers spawn real ttys for the agent — load-bearing for Windows-only test runs on `mikki-bunker-windows`.
 								- Stable hostnames via Headscale's MagicDNS — `mikki-bunker.tailnet.ts.hugo.dk` resolves regardless of network change.
 								- Identity story is clean: each worker host has its own Ed25519 keypair (`sshkey`), backed up via `melt` seed words.
 								**Negative**
 								- Tailnet dependency: when Headscale is down, *new* connections can't auth (existing connections survive). Mitigation: Headscale on a stable host with monitoring.
 								- Polyglot deployment: TS orchestrator + Go worker. One clean SSH-stdio boundary, but two languages to keep in CI. Acceptable per ADR-016 (parallel build).
 								- ACL drift: if Headscale ACLs forbid a worker host, sf degrades silently. Doctor-check should detect and surface explicitly (see "implementation" below).
 								**Risks and mitigations**
-												sf snapshot: uncommitted changes after 110m inactivity

											
										
										
											2026-05-08 00:17:47 +02:00
+								- *Risk:* SSH disconnect mid-turn produces zombie agent processes.
 								  - *Mitigation:* worker cleanup script on disconnect; `--sf-run-id=<id>` marker on the agent process for `pgrep` / `kill`.
-												feat: add SF skills and subagent debate mode

											
										
										
											2026-04-29 17:43:30 +02:00
+								- *Risk:* `wish` API churn pre-1.0.
 								  - *Mitigation:* pin a version; planned upgrade window once per quarter.
 								- *Risk:* `xpty` / `conpty` edge cases on niche shells.
 								  - *Mitigation:* worker has a flag to fall back to non-pty stdio; logged loudly.
 								## Out of Scope
 								- **Multi-tenant network isolation** (one tailnet, multiple users with separate ACL domains) — defer until concrete need.
 								- **Public-internet exposure** — sf is tailnet-only by deployment recommendation. If a use case needs a public endpoint, it goes through `tailscale funnel` or a dedicated reverse proxy outside sf.
 								- **Cross-tailnet federation** — out of scope; one tailnet per deployment.
 								## Sequencing
 								| When | Action |
 								|---|---|
 								| Now | Capture this ADR as the deployment assumption. |
 								| Tier 1 (next 1–3 months) | Build sf-worker (Go + Wish + xpty/conpty + promwish) as a separate package or repo. Orchestrator-side dispatch path in TS already plans for `worker_host` per SPEC §22 — just point it at the SSH endpoint. |
 								| Tier 2 | Doctor check: validate tailnet ACL allows the orchestrator → all configured worker hosts. Surface failures in `sf doctor`. |
 								| Tier 3 | Worker auto-provisioning script: `sf worker bootstrap <host>` generates a key, registers with Headscale, drops the worker binary. |
 								## Implementation Sketch
 								```
 								[sf orchestrator (TS)]                       on the daemon host
 								        │
 								        │  ssh user@worker.tailnet.ts.hugo.dk  --  carries sf-rpc envelope
 								        │
 								        ▼
 								[sf-worker (Go)]                             on each worker tailnet node
 								  ├── wish.Server                            with logging + elapsed + promwish middleware
 								  ├── per-connection handler                 spawns the agent via xpty/conpty
 								  ├── /metrics                               via promwish — scraped by your Prometheus
 								  └── /healthz, /readyz                      simple HTTP for orchestrator health checks
 								```
-												sf snapshot: uncommitted changes after 110m inactivity

											
										
										
											2026-05-08 00:17:47 +02:00
+								The worker is **stateless** — claim, lease, retry, persistence are all the orchestrator's job. Older SPEC notes captured this as distributed-execution evidence; current implementation must persist accepted requirements through `.sf`/DB-backed state. The worker just executes one attempt at a time and streams output.
-												feat: add SF skills and subagent debate mode

											
										
										
											2026-04-29 17:43:30 +02:00
 								## References
-												sf snapshot: uncommitted changes after 110m inactivity

											
										
										
											2026-05-08 00:17:47 +02:00
+								- Older distributed-execution SPEC notes — external design evidence only; project accepted facts into `.sf`/DB-backed state before treating them as operational.
-												feat: add SF skills and subagent debate mode

											
										
										
											2026-04-29 17:43:30 +02:00
+								- `ADR-012` — Multi-instance federation (this ADR provides the substrate).
 								- `ADR-014` — Singularity Knowledge + Agent Platform (deploys onto this substrate).
 								- `ADR-016` — Charm AI stack adoption strategy (frames why Go for new services).
 								- `charmbracelet/wish` — SSH server framework.
 								- `charmbracelet/x/xpty`, `charmbracelet/x/conpty` — pty primitives.
 								- `charmbracelet/promwish` — Prometheus middleware for Wish.
 								- Headscale — open-source Tailscale control plane.