singularity-forge/docs/dev/ADR-013-network-and-remote-execution.md
2026-04-29 17:44:30 +02:00

104 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-013: Network and remote-execution layer
**Date**: 2026-04-29
**Status**: proposed (deferred — capture for staged execution)
## Context
sf today runs as a single daemon per host. Three forces push it toward a multi-host topology:
- **SSH workers** (`SPEC.md` §22, NEW): the orchestrator dispatches unit attempts to remote hosts (GPU, Windows, parallel scaling) — needs an SSH-served worker process.
- **Singularity Memory remote-mode** (ADR-012, ADR-014, sf SPEC §16): the cross-instance knowledge layer runs as a service on the tailnet, reachable from sf, Hermes, OpenClaw, Claude Code, Cursor.
- **Multi-instance federation** (ADR-012): future federated agents and benchmarks ride the same network substrate.
This ADR fixes the network and SSH-execution layer the above all depend on.
## Decision
- **Network substrate: tailnet** — Tailscale wire protocol with **Headscale** as the self-hosted control plane (the user already runs Headscale at `mikki-bunker`). sf core is wire-agnostic; it assumes addressable, authenticated peers.
- **SSH worker host stack: Go + `charmbracelet/wish` + `charmbracelet/x/xpty`** (Linux/macOS) and **`charmbracelet/x/conpty`** (Windows). One thin Go shim per worker host; orchestrator (TS) talks SSH stdio to it.
- **Worker observability: `charmbracelet/promwish`** — Prometheus middleware mounted on Wish gives `/metrics` for free.
- **Worker identity: `charmbracelet/x/sshkey` + `charmbracelet/melt`** — auto-provisioning + Ed25519-with-seed-words backup.
## Alternatives Considered
### Network substrate
- **Public internet + sshd + manual key management** — works, but key sprawl is a real problem (each new host adds N×M keys), and dynamic IPs break stable hostnames. Tailnet's MagicDNS + ACLs replace both. Rejected.
- **Plain WireGuard mesh** — no control plane; manual peer config. Higher ops overhead than Headscale. Rejected.
- **Tailscale-the-service** — fine, but Headscale is already running and self-hosted means full ownership. Rejected.
- **ZeroTier / Netbird** — viable alternatives. Rejected because the user already has Headscale and switching costs nothing-to-gain.
### SSH worker stack
- **Node-based SSH server (`ssh2` lib)** — keeps everything TS but reinvents what Wish gives for free; no battle-tested middleware patterns. Rejected.
- **OpenSSH `sshd` with `ForceCommand`** — works for simple cases, terrible for multiplexed agent dispatch with per-connection state. Rejected.
- **Plain Go `crypto/ssh`** — lower-level than Wish, no middleware, no built-in metrics. Rejected — Wish wraps the right primitives.
## Consequences
**Positive**
- sf's network model is **explicit**: tailnet first, ACLs in Headscale's admin, no per-service auth invention.
- SSH worker host inherits Wish's mature middleware (`wish/logging`, `wish/elapsed`, etc.) and `promwish` observability.
- Cross-platform pty support (`xpty` Linux/macOS, `conpty` Windows) lets workers spawn real ttys for the agent — load-bearing for Windows-only test runs on `mikki-bunker-windows`.
- Stable hostnames via Headscale's MagicDNS — `mikki-bunker.tailnet.ts.hugo.dk` resolves regardless of network change.
- Identity story is clean: each worker host has its own Ed25519 keypair (`sshkey`), backed up via `melt` seed words.
**Negative**
- Tailnet dependency: when Headscale is down, *new* connections can't auth (existing connections survive). Mitigation: Headscale on a stable host with monitoring.
- Polyglot deployment: TS orchestrator + Go worker. One clean SSH-stdio boundary, but two languages to keep in CI. Acceptable per ADR-016 (parallel build).
- ACL drift: if Headscale ACLs forbid a worker host, sf degrades silently. Doctor-check should detect and surface explicitly (see "implementation" below).
**Risks and mitigations**
- *Risk:* SSH disconnect mid-turn produces zombie agent processes (SPEC §22.3).
- *Mitigation:* spec-mandated remote-cleanup script on disconnect; `--sf-run-id=<id>` marker on the agent process for `pgrep` / `kill`.
- *Risk:* `wish` API churn pre-1.0.
- *Mitigation:* pin a version; planned upgrade window once per quarter.
- *Risk:* `xpty` / `conpty` edge cases on niche shells.
- *Mitigation:* worker has a flag to fall back to non-pty stdio; logged loudly.
## Out of Scope
- **Multi-tenant network isolation** (one tailnet, multiple users with separate ACL domains) — defer until concrete need.
- **Public-internet exposure** — sf is tailnet-only by deployment recommendation. If a use case needs a public endpoint, it goes through `tailscale funnel` or a dedicated reverse proxy outside sf.
- **Cross-tailnet federation** — out of scope; one tailnet per deployment.
## Sequencing
| When | Action |
|---|---|
| Now | Capture this ADR as the deployment assumption. |
| Tier 1 (next 13 months) | Build sf-worker (Go + Wish + xpty/conpty + promwish) as a separate package or repo. Orchestrator-side dispatch path in TS already plans for `worker_host` per SPEC §22 — just point it at the SSH endpoint. |
| Tier 2 | Doctor check: validate tailnet ACL allows the orchestrator → all configured worker hosts. Surface failures in `sf doctor`. |
| Tier 3 | Worker auto-provisioning script: `sf worker bootstrap <host>` generates a key, registers with Headscale, drops the worker binary. |
## Implementation Sketch
```
[sf orchestrator (TS)] on the daemon host
│ ssh user@worker.tailnet.ts.hugo.dk -- carries sf-rpc envelope
[sf-worker (Go)] on each worker tailnet node
├── wish.Server with logging + elapsed + promwish middleware
├── per-connection handler spawns the agent via xpty/conpty
├── /metrics via promwish — scraped by your Prometheus
└── /healthz, /readyz simple HTTP for orchestrator health checks
```
The worker is **stateless** — claim, lease, retry, persistence are all the orchestrator's job (per SPEC §22). The worker just executes one attempt at a time and streams output.
## References
- `SPEC.md` §22 — Distributed Execution.
- `ADR-012` — Multi-instance federation (this ADR provides the substrate).
- `ADR-014` — Singularity Knowledge + Agent Platform (deploys onto this substrate).
- `ADR-016` — Charm AI stack adoption strategy (frames why Go for new services).
- `charmbracelet/wish` — SSH server framework.
- `charmbracelet/x/xpty`, `charmbracelet/x/conpty` — pty primitives.
- `charmbracelet/promwish` — Prometheus middleware for Wish.
- Headscale — open-source Tailscale control plane.