singularity-forge/docs/dev/ADR-012-multi-instance-federation.md

108 lines
8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-012: Multi-instance federation — when sf instances interlink
**Date**: 2026-04-29
**Status**: proposed (deferred — capture for future implementation)
## Context
sf today is **per-project**: each project has its own `.sf/sf.db`, and a single daemon (`packages/daemon`) on a host serves all projects under its scan roots. As deployment grows beyond one host (laptop, `mikki-bunker`, `aidev`), the question arises: should sf instances on different hosts (or different projects on the same host) interlink? And if so, on which surfaces?
Without thought-out federation, instances repeatedly re-learn the same lessons — anti-patterns, model outages, provider quirks — wasting tokens and duplicating mistakes. With over-eager federation, sf inherits cross-host trust, schema-version, and latency problems it doesn't need yet.
This ADR maps the federation surfaces, takes a position on each, and sequences the work.
## Decision
**Defer most federation. Wire Singularity Memory first as the single load-bearing federation primitive; defer federated benchmarks, cross-repo orchestration, and federated agents until the pain is concrete.**
## Federation Surfaces
### Surface 1 — Knowledge (anti-patterns, learnings, contracts)
**Status:** captured in older SPEC notes as §16; treat that as external design evidence, not current operational authority. The current SF working model must project accepted federation facts into `.sf`/DB-backed state. Singularity Memory (`sm`) is the proposed cross-instance knowledge layer: an HTTP API holding memories, learnings, and anti-patterns.
**Code reality:** not yet wired. `src/resources/extensions/sf/memory-store.ts` and `memory-extractor.ts` write to a local SQLite `memories` table. The spec's "remote-mode" isn't connected.
**Decision:** **wire it.** Singularity Memory is the load-bearing federation primitive. If Mikki learns "Provider X drops requests at 03:00 UTC", that anti-pattern should be reachable from any sf instance on the tailnet without re-learning. Once wired, ~80 % of the "should they interlink?" question answers itself.
### Surface 2 — Benchmarks and circuit breakers
**Status:** per-DB today. `benchmark_results` and `circuit_breakers` tables live in each project's `.sf/sf.db`. One instance trips a breaker on `kimi-coding/k2p5`; another instance has to independently rediscover the outage.
**Decision:** **defer; revisit after Singularity Memory lands.** Two clean options when we revisit:
- **Ride Singularity Memory** — store benchmark observations as a memory category, recall as needed. Cheap; semantically clean (benchmarks ARE learning).
- **Separate thin HTTP service** — purpose-built benchmark aggregator with statistical smoothing and a publish/subscribe channel for circuit-breaker events.
The pain ceiling is bounded today (per-instance discovery is at worst a few wasted dispatches). Only build when concrete cost emerges.
### Surface 3 — Cross-project unit dependencies
**Status:** not designed. sf has no concept of "milestone in repo A produces an artefact repo B depends on". The unit hierarchy (milestone → slice → task) is project-local.
**Decision:** **out of scope for sf.** Cross-repo orchestration is a different abstraction layer — it belongs in a meta-coordinator that consumes sf's daemon/RPC or headless interfaces, not in sf itself. Building it inside sf would conflate "agent that ships one project" with "fleet manager that ships an org's roadmap." Different products.
### Surface 4 — Federated persistent agents
**Status:** not designed. Older SPEC notes sketched persistent agents scoped to a single project's DB; those notes are evidence only until projected into current `.sf`/DB state.
**Decision:** **defer.** Per-instance for v3. If Mikki has a "code-reviewer" persistent agent, it lives in Mikki's DB. Federation requires:
- Cross-host auth (who can wake whose agents).
- Agent-state schema versioning (instances may run different sf versions).
- Leader-election story for shared-agent updates.
- A migration path from per-instance → federated.
None of this earns its keep until we have a concrete use case where one agent should genuinely serve multiple projects/hosts. Premature now.
### Surface 5 — Distributed execution (clarifying note, not federation)
**Status:** captured in older SPEC notes; not built. SSH workers mean one daemon dispatches units to remote worker hosts.
**Decision:** **clarify that this is NOT federation.** Distributed execution = one daemon owns many workers (parallel scaling). Federation = many daemons share state across hosts (knowledge sharing). Different problems. The spec already separates them; this ADR just affirms the line.
## Consequences
**Positive (after Singularity Memory lands)**
- **Knowledge sharing without re-learning** — anti-patterns, gotchas, contract findings reachable across hosts and other agent products on the tailnet.
- **Lower per-instance cost** — fewer wasted dispatches re-discovering provider quirks.
- **Reusable for non-sf agents** — Hermes, Claude Code, Cursor can also read/write Singularity Memory, so the network effect grows beyond sf.
**Negative**
- **Tailnet dependency** — when remote-mode Singularity Memory is configured, tailnet outage degrades sf to local-only. Mitigation: spec already allows embedded (in-process) mode; remote is opt-in.
- **Cross-instance prompt-injection surface** — a malicious memory written by one instance could leak into another's recall. Mitigation: Singularity Memory MUST track provenance per memory and let consumers filter by trusted source. Capture as a sub-ADR if/when implemented.
- **Schema versioning across instances** — different sf versions accessing the same memory store. Mitigation: memory schema must be append-only and additive; new fields are optional reads.
**Risks and mitigations**
- *Risk:* Singularity Memory becomes a bottleneck — sf can't dispatch when memory is down.
- *Mitigation:* sf MUST treat memory as best-effort. A memory-fetch failure logs degraded-mode and proceeds with empty recall. Local SQLite stays as the authoritative scheduler state.
- *Risk:* federated benchmarks make sf overconfident in stale data.
- *Mitigation:* every benchmark observation carries `recorded_at` and `host`. Consumers weight by recency and reject stale data older than `circuit_breaker_resets_at + N`.
- *Risk:* cross-instance attacker plants poisoned anti-patterns to steer agent behaviour.
- *Mitigation:* same as the prompt-injection mitigation above — provenance + trusted-source filter, plus rate-limiting per writer.
## Out of Scope
- **Cross-repo unit graph** — meta-coordinator territory.
- **Federated persistent-agent fleets** — defer until concrete pain.
- **Multi-tenant Singularity Memory** — current design assumes a single-user-or-team trust domain. Multi-tenant is a separate product.
- **Auto-sharding sf instances** — sf is one daemon per host; we don't horizontally split a single host's daemon.
## Sequencing
| When | Action |
|---|---|
| Tier 1+ (next 13 months) | Wire Singularity Memory remote-mode in `memory-store.ts`. Provider chain fallback: remote → embedded → local-only. Promote accepted runtime requirements into `.sf`/DB-backed state once landed. |
| After Singularity Memory in production for 1+ month | Decide whether to ride it for benchmarks (Surface 2) or build a separate service. Decision driven by observed cost of duplicated benchmark discovery. |
| If/when concrete cross-instance agent pain shows up | Reopen Surface 4 (federated persistent agents). Don't pre-build. |
| Never in sf | Surface 3 (cross-repo unit deps) — that's a separate product. |
## References
- Older SPEC notes for Singularity Memory, persistent agents, inter-agent messaging, and distributed execution — external design evidence only; project accepted facts into `.sf`/DB-backed state before treating them as operational.
- `src/resources/extensions/sf/memory-store.ts` — current local-only memory store.
- `packages/daemon/src/daemon.ts` — single-host daemon process.
- `docs/dev/ADR-011-swarm-chat-and-debate-mode.md` — related: ephemeral swarms within a single instance.