docs: clarify SF harness rollout boundaries

2026-04-29 17:47:51 +02:00 · 2026-04-29 17:47:51 +02:00 · b32fe7acd1
commit b32fe7acd1
parent d78c5ac198
4 changed files with 67 additions and 39 deletions
--- a/BUILD_PLAN.md
+++ b/BUILD_PLAN.md
@ -98,7 +98,7 @@ These came up during recent ports and refactor passes — tracked here so they d
 | **Pi-mono SDK sync** | We pull from pi-mono directly (separate from gsd-2 sync stance). Periodically check `pi-mono/main` for SDK improvements worth taking. The remote is set up; cadence is not. | 3 | recurring |
 | **Caveman input-side compression** (manual) | Caveman skill installed (output compression, ~75% fewer agent tokens). Input side — sf's own prompts (`execute-task.md`, `discuss.md`, `plan-*.md`, etc.) — is verbose: 10-step instruction lists, `runtimeContext`, `memoriesSection`, `taskPlanInline`, `slicePlanExcerpt`. Manually rewrite the heaviest sections in caveman style (preserve intent + nuance, drop fluff). Test against current to confirm no quality regression. | 2 | 1-2 days |
 | **Runtime input preprocessor** (caveman-compress) | Add a transformation step in dispatch that pipes sf's rendered prompt through `caveman-compress` (sub-skill in juliusbrussee/caveman repo, ~46% input-token reduction) before LLM call. Only enable when a `terse_prompts: true` preference is set. Adds a layer that can drift from authored intent — needs a comparison harness. | 3 | 3-4 days |
-| **Swarm chat / debate mode** for `subagent` tool | Today `subagent({ tasks: [...] })` runs parallel fire-and-forget — adversarial reviewers never engage each other's strongest defence. Add `mode: "debate"` + `rounds: N` so each task sees prior rounds' outputs. See [ADR-011](docs/dev/ADR-011-swarm-chat-and-debate-mode.md) — Option A (round-robin debate) first, Option C (full inbox-based swarm chat) after the persistent-agent layer (SPEC §17–18) lands. | 2 | 1 week (Option A); ~3 weeks (Option C, depends on persistent-agent layer) |
+| **Full swarm chat for `subagent` tool** | Round-robin debate mode now exists as `subagent({ mode: "debate", rounds: N, tasks: [...] })`, so adversarial reviewers can engage prior-round arguments. Remaining work is Option C from [ADR-011](docs/dev/ADR-011-swarm-chat-and-debate-mode.md): full inbox-based swarm chat after the persistent-agent layer (SPEC §17–18) lands. | 3 | ~3 weeks (depends on persistent-agent layer) |
 | **Singularity Knowledge + Agent Platform (Go re-platform)** | Re-platform Singularity Memory from Python+FastAPI+Postgres+vchord to Go on Charm: charm-server patterns for auth/identity, fantasy as agent runtime, same Postgres+vchord for retrieval, exact wire-contract preserved. Load-bearing for cross-instance knowledge federation AND future central persistent agents (sf SPEC §17). See [ADR-014](docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md) and [`singularity-memory/MIGRATION.md`](https://github.com/singularity-ng/singularity-memory/blob/main/MIGRATION.md). | 1 | ~12 weeks across phases |
 | **Wire sf to Singularity Memory remote-mode** | sf-side: change `memory-store.ts` provider chain from local-SQLite-only to remote-Singularity-Memory → embedded → local-only fallback. Once wired, ~80% of the "should sf instances interlink?" question (ADR-012) is answered for free. Depends on the platform itself being live. | 1 | 1 week post-platform |
 | **sf-worker SSH host** | Build the Go-based SSH worker host for distributed execution (SPEC §22, NEW): `wish` + `xpty`/`conpty` + `promwish`. Orchestrator dispatches over SSH; worker spawns the agent in a real pty per attempt; Prometheus metrics for free. See [ADR-013](docs/dev/ADR-013-network-and-remote-execution.md). | 2 | ~3 weeks |
--- a/SPEC.md
+++ b/SPEC.md
@ -1784,7 +1784,11 @@ Deep analysis is default, not opt-in:
 ## 17. Persistent Agents
-> **Status: PARTIAL** — sf has subagents (test files: `subagent-agent-discovery`, `subagent-model-dispatch`, `agent-end-retry`; module: `bootstrap/agent-end-recovery.ts`). The spec's persistent-identity + memory-blocks + inbox-wake model is NEW.
+> **Status: PARTIAL** — sf has ephemeral subagents, including single,
 > parallel, chain, and bounded debate batches (`subagent({ mode: "debate",
 > rounds, tasks })`; tests include `subagent-agent-discovery`,
 > `subagent-model-dispatch`, `agent-end-retry`, `subagent-debate-mode`).
 > The spec's persistent-identity + memory-blocks + inbox-wake model is NEW.
 ### 17.1 Agent vs unit
--- a/docs/dev/ADR-011-swarm-chat-and-debate-mode.md
+++ b/docs/dev/ADR-011-swarm-chat-and-debate-mode.md
@ -1,11 +1,11 @@
 # ADR-011: Swarm chat and debate mode for ephemeral subagents
 **Date**: 2026-04-29
-**Status**: proposed (deferred — capture for future implementation)
+**Status**: accepted (Option A implemented; full swarm chat deferred)
 ## Context
-sf's `subagent` tool today dispatches one or more subagents in **parallel fire-and-forget** mode (`subagent({ tasks: [...] })`). All tasks run concurrently; none see each other; the parent collects results and synthesises.
+sf's `subagent` tool originally dispatched one or more subagents in **parallel fire-and-forget** mode (`subagent({ tasks: [...] })`). All tasks ran concurrently; none saw each other; the parent collected results and synthesised.
 This is sufficient for many cases (parallel research, parallel gate evaluation), but it has a structural gap for **adversarial review** and **multi-stakeholder negotiation**:
@ -17,11 +17,20 @@ The user asked whether agent-to-agent communication could happen inside ephemera
 ## Decision
-**Defer.** Capture the design in this ADR and a `BUILD_PLAN.md` row. Implement after the persistent-agent layer (`agents`, `agent_messages`, `agent_inbox`, `send_message` tool) lands as a NEW tier, since 90 % of the machinery is shared. Implement Option A (debate mode) first as a forcing function — once we see how much real debate improves outcomes, the case for full swarm-chat (Option C) writes itself.
+**Implement Option A now; defer Option C.**
 Round-robin debate mode is implemented on the existing `subagent` tool as
 `subagent({ mode: "debate", rounds: N, tasks: [...] })`. It gives each
 participant the prior rounds' transcript and keeps the parent as synthesiser.
 Full inbox-based swarm chat remains deferred until the persistent-agent layer
 (`agents`, `agent_messages`, `agent_inbox`, `send_message` tool) lands. That
 machinery is still shared with SPEC §17-18 and should not be rebuilt inside the
 ephemeral subagent extension.
 ## Alternatives Considered
-### Option A — Round-robin debate mode (RECOMMENDED first)
+### Option A — Round-robin debate mode (IMPLEMENTED)
 Add `mode: "debate"` and `rounds: N` to the `subagent` tool. Each round, every task sees the previous round's outputs.
@ -30,8 +39,8 @@ subagent({
  mode: "debate",
  rounds: 3,
  tasks: [
-    { id: "advocate",   model_tier: "validation", prompt: "Make case for X. ..." },
+    { agent: "reviewer", task: "Make case for X. ..." },
-    { id: "challenger", model_tier: "validation", prompt: "Attack X. ..." }
+    { agent: "reviewer", task: "Attack X. ..." }
  ]
 })
 ```
@ -42,7 +51,8 @@ subagent({
 - **Why not**: doesn't support free-form many-to-many messaging. Each task speaks once per round in a fixed order.
 - **Why first**: smallest change, biggest immediate quality win, reusable as a primitive.
-**Effort**: ~1 dev-week. Touches: `subagent` tool definition, dispatch path in pi-coding-agent, new test cases, `dispatching-subagents` skill section, possibly `advisory-partner` skill update.
+**Implementation:** `src/resources/extensions/subagent/index.ts`.
 **Regression test:** `src/tests/subagent-debate-mode.test.ts`.
 ### Option B — Shared scratchpad
@ -78,12 +88,12 @@ swarm({
 **Positive**
 - **Higher-quality adversarial review** — the challenger actually engages the advocate's strongest defence, instead of issuing a parallel monologue.
- **Multi-stakeholder negotiation** — the Vision Alignment Meeting becomes a real meeting, not a parallel survey.
+- **Multi-stakeholder pressure testing** — the Vision Alignment Meeting can use bounded debate rounds instead of only a parallel survey.
 - **Reusable primitive** — debate mode can be invoked from any skill that today does `subagent({ tasks: [advocate, challenger] })` (currently `advisory-partner`, `brainstorming`, `requesting-code-review`).
 **Negative**
- **Cost grows linearly with rounds.** A 3-round debate is 3× the tokens. Budget gates need updating in `auto-budget.ts` so debate dispatches don't silently blow past the per-unit ceiling.
+- **Cost grows linearly with rounds.** A 3-round debate is roughly 3× the tokens. Callers should reserve budget accordingly.
 - **Determinism drops.** A fire-and-collect batch is reproducible from prompts alone; a debate is path-dependent. Trace recording becomes more important — `.sf/traces/` must capture each round.
 - **Synthesis complexity rises** — the parent must summarise a debate transcript, not just collect verdicts. The synthesis prompt itself becomes a tunable artefact.
@ -102,39 +112,32 @@ swarm({
 - **Cross-session swarm replay** — a swarm session, once archived, is read-only. No "fork from round 2" support in v1.
 - **Human-in-the-loop debate** — swarms are agent-to-agent only. If the user wants to inject a turn, that's a different surface (the existing `discuss` flow).
-## Implementation Sketch (Option A first)
+## Implementation Notes (Option A)
-1. Extend `subagent` tool:
+1. `subagent` accepts `mode: "parallel" | "debate"` on `tasks` batches.
-   - Add `mode` field: `"parallel"` (default, current behaviour) | `"debate"`.
+2. `rounds` defaults to `2`, is capped at `5`, and is valid only with
-   - Add `rounds` field (required when `mode = "debate"`, default `2`, max `5`).
+   `mode: "debate"`.
-2. In the dispatch layer (pi-coding-agent / sf adapter):
+3. Debate requires at least two participants.
-   - For `mode = "debate"`: maintain an in-memory transcript per swarm. Each round, render `previous_rounds_transcript` as a context block and append it to each task's prompt.
+4. Each round runs the participant tasks, then appends their outputs to an
-   - Per-round trace span: `swarm.<id>.round.<n>.task.<id>` so `.sf/traces/` reflects the structure.
+   in-memory transcript.
-3. Synthesis prompt:
+5. Later rounds receive the transcript under `Debate transcript so far`.
-   - When all rounds complete, the parent receives the full transcript plus a synthesis directive: "summarise the strongest claim, the strongest objection, the convergence (if any), and the residual disagreement."
+6. The final round asks each participant to end with `FINAL_VERDICT`.
-4. Budget gate:
+7. The parent still owns synthesis and persistence; debate mode does not create
-   - `auto-budget.ts` needs to multiply the projected cost by `rounds` before approving the dispatch.
+   persistent agent messages.
 5. Tests:
   - Unit test: a 2-round debate produces a transcript with 4 turns (2 tasks × 2 rounds).
   - Integration test: an advocate/challenger pair on a known weak design — verify the falsifier surfaces by round 3 (vs. parallel mode where it doesn't).
 6. Skill updates:
   - `advisory-partner` — add "for non-trivial reviews, consider `mode: 'debate'` over parallel fire".
   - `brainstorming` Step 5 — same.
   - `dispatching-subagents` — add a "debate mode" pattern between Pattern 2 and Pattern 3.
 ## Sequencing
 | When | Why |
 |---|---|
-| Persistent-agent layer scoped (`SPEC.md` §17 NEW → IN PROGRESS) | Most of Option A's machinery (transcript persistence, message scoping) overlaps. |
+| Now | Option A is available as bounded debate mode on `subagent`. |
-| Option A implemented | Forcing function — observe quality lift on adversarial reviews. |
+| Six months of Option A in production | Decide whether full swarm-chat with inbox is worth the build. |
-| Six months of Option A in production | Decide whether Option C (full swarm-chat with inbox) is worth the build. |
+| Persistent-agent layer scoped (`SPEC.md` §17 NEW → IN PROGRESS) | Revisit Option C because inbox/message persistence machinery will exist. |
 ## References
 - `docs/SPEC.md` §17 (Persistent Agents) — defines `agents`, `agent_memory_blocks`, `agent_messages`, `agent_inbox`.
 - `docs/SPEC.md` §18 (Inter-Agent Messaging) — defines `send_message` tool. Currently NEW (not implemented).
- `src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md` — current parallel-only contract.
+- `src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md` — current single/parallel/debate/chain guidance.
 - `src/resources/extensions/sf/skills/advisory-partner/SKILL.md` — primary consumer of adversarial dispatch today.
 - `src/resources/extensions/sf/prompts/gate-evaluate.md` — pre-execution Q3/Q4 gates.
 - `src/resources/extensions/sf/prompts/validate-milestone.md` — post-execution 3-reviewer pattern.
--- a/docs/dev/ADR-018-repo-native-harness-evolution.md
+++ b/docs/dev/ADR-018-repo-native-harness-evolution.md
@ -27,15 +27,35 @@ The system starts from template kits, then adapts them to the repository by read
 Add the contract to markdown now. Add runtime flow behavior later behind tests.
-The first implementation should not start by changing the worker prompt. It should add a pre-plan profile snapshot and a post-unit evidence retention hook, because those are observable and testable without changing every dispatch. Once those are stable, sf can inject harness/memory context into planning and verification prompts.
+The first implementation should not start by changing the worker prompt or
 writing repo-local harness files. It should add a pre-plan profile snapshot and
 a post-unit evidence retention hook, because those are observable and testable
 without changing every dispatch. Once those are stable, sf can inject
 harness/memory context into planning and verification prompts.
 Near-term repository-write boundary:
 - All repositories use the same sf built-in skills and harness behavior.
 - sf MUST NOT generate repo-local custom skill packs such as `.agents/skills/`
  for project repos.
 - sf MUST NOT create tracked `harness/`, `gates/`, CI, or repo spec files as
  part of normal initialization.
 - The only project-level file write allowed by this stream before the explicit
  harness-writer phase is sf project preferences/config, such as
  `.sf/PREFERENCES.md` or `.sf/preferences.md`, when the user asks for project
  preferences.
 - `.sf/sf.db` may record ignored operational state, including repo profiles and
  untracked-file observations. That is not repo ownership and must not be
  staged by default.
 | When | Flow addition | Why |
 |---|---|---|
 | Now, in docs | Define repo profiling, untracked observation, harness planning, eval/judge rig, and memory retention contracts. | Gives implementation a stable target. |
-| First code slice | Add read-only repo profile snapshot before planning. | Lets sf understand repo shape without taking ownership. |
+| First code slice | Add read-only repo profile snapshot before planning. | Lets sf understand repo shape without taking ownership or writing tracked files. |
 | Second code slice | Add post-unit evidence retention into `.sf/sf.db` and Singularity Memory. | Converts gate results into future guidance. |
-| Third code slice | Add harness proposal generation as a planning artifact. | Keeps generated files reviewable before write. |
+| Third code slice | Add harness proposal generation as a planning artifact. | Produces dry-run proposals only; no tracked repo files are written. |
 | Later | Inject harness/memory context into runtime prompts and workflow templates. | This changes agent behavior and needs regression fixtures. |
 | Explicit opt-in later | Enable Harness Writer for reviewed diffs. | Allows tracked harness files only when a unit plan claims them and the user accepts the diff. |
 ### Files, database, and memory
@ -43,7 +63,7 @@ Use all three layers, with separate responsibilities:
 | Layer | Role | Examples |
 |---|---|---|
-| Tracked repo files | Durable contract and executable harness | `SPEC.md`, `ARCHITECTURE.md`, `harness/manifest.json`, `harness/evals/*.jsonl`, `gates/*.sh`, CI workflow snippets |
+| Tracked repo files | Future durable contract and executable harness after explicit opt-in | `SPEC.md`, `ARCHITECTURE.md`, `harness/manifest.json`, `harness/evals/*.jsonl`, `gates/*.sh`, CI workflow snippets |
 | `.sf/sf.db` | Operational state and evidence ledger | repo profile snapshots, harness inventory, eval runs, gate results, drift events, untracked-file observations |
 | Singularity Memory | Cross-session knowledge | proven patterns, anti-patterns, recurring failures, repo-specific risk notes, judge calibration lessons |
@ -152,10 +172,11 @@ Detailed design is in `repo-native-harness-architecture.md`.
 | Stage | Work | Result |
 |---|---|---|
 | 1 | Add repo profile snapshots and untracked observation model. | sf understands repo shape without taking ownership. |
-| 2 | Add template kit registry and harness manifest format. | sf can generate reviewable harness files. |
+| 2 | Add template kit registry and harness manifest format. | sf can generate dry-run harness proposals without writing repo files. |
 | 3 | Add judge rig and eval suite runner. | AI and agent behavior becomes measurable. |
 | 4 | Connect evidence to Singularity Memory. | Patterns and anti-patterns improve future dispatch. |
-| 5 | Add drift detection and automatic harness update proposals. | Harnesses evolve with the repo. |
+| 5 | Add drift detection and automatic harness update proposals. | Harnesses evolve with the repo as proposals. |
 | 6 | Add explicit opt-in Harness Writer. | Reviewed repo diffs can create tracked harness files; repo-local skills remain out of scope unless separately accepted. |
 ## References