docs: clarify SF harness rollout boundaries

This commit is contained in:
Mikael Hugo 2026-04-29 17:47:51 +02:00
parent d78c5ac198
commit b32fe7acd1
4 changed files with 67 additions and 39 deletions

View file

@ -98,7 +98,7 @@ These came up during recent ports and refactor passes — tracked here so they d
| **Pi-mono SDK sync** | We pull from pi-mono directly (separate from gsd-2 sync stance). Periodically check `pi-mono/main` for SDK improvements worth taking. The remote is set up; cadence is not. | 3 | recurring | | **Pi-mono SDK sync** | We pull from pi-mono directly (separate from gsd-2 sync stance). Periodically check `pi-mono/main` for SDK improvements worth taking. The remote is set up; cadence is not. | 3 | recurring |
| **Caveman input-side compression** (manual) | Caveman skill installed (output compression, ~75% fewer agent tokens). Input side — sf's own prompts (`execute-task.md`, `discuss.md`, `plan-*.md`, etc.) — is verbose: 10-step instruction lists, `runtimeContext`, `memoriesSection`, `taskPlanInline`, `slicePlanExcerpt`. Manually rewrite the heaviest sections in caveman style (preserve intent + nuance, drop fluff). Test against current to confirm no quality regression. | 2 | 1-2 days | | **Caveman input-side compression** (manual) | Caveman skill installed (output compression, ~75% fewer agent tokens). Input side — sf's own prompts (`execute-task.md`, `discuss.md`, `plan-*.md`, etc.) — is verbose: 10-step instruction lists, `runtimeContext`, `memoriesSection`, `taskPlanInline`, `slicePlanExcerpt`. Manually rewrite the heaviest sections in caveman style (preserve intent + nuance, drop fluff). Test against current to confirm no quality regression. | 2 | 1-2 days |
| **Runtime input preprocessor** (caveman-compress) | Add a transformation step in dispatch that pipes sf's rendered prompt through `caveman-compress` (sub-skill in juliusbrussee/caveman repo, ~46% input-token reduction) before LLM call. Only enable when a `terse_prompts: true` preference is set. Adds a layer that can drift from authored intent — needs a comparison harness. | 3 | 3-4 days | | **Runtime input preprocessor** (caveman-compress) | Add a transformation step in dispatch that pipes sf's rendered prompt through `caveman-compress` (sub-skill in juliusbrussee/caveman repo, ~46% input-token reduction) before LLM call. Only enable when a `terse_prompts: true` preference is set. Adds a layer that can drift from authored intent — needs a comparison harness. | 3 | 3-4 days |
| **Swarm chat / debate mode** for `subagent` tool | Today `subagent({ tasks: [...] })` runs parallel fire-and-forget — adversarial reviewers never engage each other's strongest defence. Add `mode: "debate"` + `rounds: N` so each task sees prior rounds' outputs. See [ADR-011](docs/dev/ADR-011-swarm-chat-and-debate-mode.md) — Option A (round-robin debate) first, Option C (full inbox-based swarm chat) after the persistent-agent layer (SPEC §1718) lands. | 2 | 1 week (Option A); ~3 weeks (Option C, depends on persistent-agent layer) | | **Full swarm chat for `subagent` tool** | Round-robin debate mode now exists as `subagent({ mode: "debate", rounds: N, tasks: [...] })`, so adversarial reviewers can engage prior-round arguments. Remaining work is Option C from [ADR-011](docs/dev/ADR-011-swarm-chat-and-debate-mode.md): full inbox-based swarm chat after the persistent-agent layer (SPEC §1718) lands. | 3 | ~3 weeks (depends on persistent-agent layer) |
| **Singularity Knowledge + Agent Platform (Go re-platform)** | Re-platform Singularity Memory from Python+FastAPI+Postgres+vchord to Go on Charm: charm-server patterns for auth/identity, fantasy as agent runtime, same Postgres+vchord for retrieval, exact wire-contract preserved. Load-bearing for cross-instance knowledge federation AND future central persistent agents (sf SPEC §17). See [ADR-014](docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md) and [`singularity-memory/MIGRATION.md`](https://github.com/singularity-ng/singularity-memory/blob/main/MIGRATION.md). | 1 | ~12 weeks across phases | | **Singularity Knowledge + Agent Platform (Go re-platform)** | Re-platform Singularity Memory from Python+FastAPI+Postgres+vchord to Go on Charm: charm-server patterns for auth/identity, fantasy as agent runtime, same Postgres+vchord for retrieval, exact wire-contract preserved. Load-bearing for cross-instance knowledge federation AND future central persistent agents (sf SPEC §17). See [ADR-014](docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md) and [`singularity-memory/MIGRATION.md`](https://github.com/singularity-ng/singularity-memory/blob/main/MIGRATION.md). | 1 | ~12 weeks across phases |
| **Wire sf to Singularity Memory remote-mode** | sf-side: change `memory-store.ts` provider chain from local-SQLite-only to remote-Singularity-Memory → embedded → local-only fallback. Once wired, ~80% of the "should sf instances interlink?" question (ADR-012) is answered for free. Depends on the platform itself being live. | 1 | 1 week post-platform | | **Wire sf to Singularity Memory remote-mode** | sf-side: change `memory-store.ts` provider chain from local-SQLite-only to remote-Singularity-Memory → embedded → local-only fallback. Once wired, ~80% of the "should sf instances interlink?" question (ADR-012) is answered for free. Depends on the platform itself being live. | 1 | 1 week post-platform |
| **sf-worker SSH host** | Build the Go-based SSH worker host for distributed execution (SPEC §22, NEW): `wish` + `xpty`/`conpty` + `promwish`. Orchestrator dispatches over SSH; worker spawns the agent in a real pty per attempt; Prometheus metrics for free. See [ADR-013](docs/dev/ADR-013-network-and-remote-execution.md). | 2 | ~3 weeks | | **sf-worker SSH host** | Build the Go-based SSH worker host for distributed execution (SPEC §22, NEW): `wish` + `xpty`/`conpty` + `promwish`. Orchestrator dispatches over SSH; worker spawns the agent in a real pty per attempt; Prometheus metrics for free. See [ADR-013](docs/dev/ADR-013-network-and-remote-execution.md). | 2 | ~3 weeks |

View file

@ -1784,7 +1784,11 @@ Deep analysis is default, not opt-in:
## 17. Persistent Agents ## 17. Persistent Agents
> **Status: PARTIAL** — sf has subagents (test files: `subagent-agent-discovery`, `subagent-model-dispatch`, `agent-end-retry`; module: `bootstrap/agent-end-recovery.ts`). The spec's persistent-identity + memory-blocks + inbox-wake model is NEW. > **Status: PARTIAL** — sf has ephemeral subagents, including single,
> parallel, chain, and bounded debate batches (`subagent({ mode: "debate",
> rounds, tasks })`; tests include `subagent-agent-discovery`,
> `subagent-model-dispatch`, `agent-end-retry`, `subagent-debate-mode`).
> The spec's persistent-identity + memory-blocks + inbox-wake model is NEW.
### 17.1 Agent vs unit ### 17.1 Agent vs unit

View file

@ -1,11 +1,11 @@
# ADR-011: Swarm chat and debate mode for ephemeral subagents # ADR-011: Swarm chat and debate mode for ephemeral subagents
**Date**: 2026-04-29 **Date**: 2026-04-29
**Status**: proposed (deferred — capture for future implementation) **Status**: accepted (Option A implemented; full swarm chat deferred)
## Context ## Context
sf's `subagent` tool today dispatches one or more subagents in **parallel fire-and-forget** mode (`subagent({ tasks: [...] })`). All tasks run concurrently; none see each other; the parent collects results and synthesises. sf's `subagent` tool originally dispatched one or more subagents in **parallel fire-and-forget** mode (`subagent({ tasks: [...] })`). All tasks ran concurrently; none saw each other; the parent collected results and synthesised.
This is sufficient for many cases (parallel research, parallel gate evaluation), but it has a structural gap for **adversarial review** and **multi-stakeholder negotiation**: This is sufficient for many cases (parallel research, parallel gate evaluation), but it has a structural gap for **adversarial review** and **multi-stakeholder negotiation**:
@ -17,11 +17,20 @@ The user asked whether agent-to-agent communication could happen inside ephemera
## Decision ## Decision
**Defer.** Capture the design in this ADR and a `BUILD_PLAN.md` row. Implement after the persistent-agent layer (`agents`, `agent_messages`, `agent_inbox`, `send_message` tool) lands as a NEW tier, since 90 % of the machinery is shared. Implement Option A (debate mode) first as a forcing function — once we see how much real debate improves outcomes, the case for full swarm-chat (Option C) writes itself. **Implement Option A now; defer Option C.**
Round-robin debate mode is implemented on the existing `subagent` tool as
`subagent({ mode: "debate", rounds: N, tasks: [...] })`. It gives each
participant the prior rounds' transcript and keeps the parent as synthesiser.
Full inbox-based swarm chat remains deferred until the persistent-agent layer
(`agents`, `agent_messages`, `agent_inbox`, `send_message` tool) lands. That
machinery is still shared with SPEC §17-18 and should not be rebuilt inside the
ephemeral subagent extension.
## Alternatives Considered ## Alternatives Considered
### Option A — Round-robin debate mode (RECOMMENDED first) ### Option A — Round-robin debate mode (IMPLEMENTED)
Add `mode: "debate"` and `rounds: N` to the `subagent` tool. Each round, every task sees the previous round's outputs. Add `mode: "debate"` and `rounds: N` to the `subagent` tool. Each round, every task sees the previous round's outputs.
@ -30,8 +39,8 @@ subagent({
mode: "debate", mode: "debate",
rounds: 3, rounds: 3,
tasks: [ tasks: [
{ id: "advocate", model_tier: "validation", prompt: "Make case for X. ..." }, { agent: "reviewer", task: "Make case for X. ..." },
{ id: "challenger", model_tier: "validation", prompt: "Attack X. ..." } { agent: "reviewer", task: "Attack X. ..." }
] ]
}) })
``` ```
@ -42,7 +51,8 @@ subagent({
- **Why not**: doesn't support free-form many-to-many messaging. Each task speaks once per round in a fixed order. - **Why not**: doesn't support free-form many-to-many messaging. Each task speaks once per round in a fixed order.
- **Why first**: smallest change, biggest immediate quality win, reusable as a primitive. - **Why first**: smallest change, biggest immediate quality win, reusable as a primitive.
**Effort**: ~1 dev-week. Touches: `subagent` tool definition, dispatch path in pi-coding-agent, new test cases, `dispatching-subagents` skill section, possibly `advisory-partner` skill update. **Implementation:** `src/resources/extensions/subagent/index.ts`.
**Regression test:** `src/tests/subagent-debate-mode.test.ts`.
### Option B — Shared scratchpad ### Option B — Shared scratchpad
@ -78,12 +88,12 @@ swarm({
**Positive** **Positive**
- **Higher-quality adversarial review** — the challenger actually engages the advocate's strongest defence, instead of issuing a parallel monologue. - **Higher-quality adversarial review** — the challenger actually engages the advocate's strongest defence, instead of issuing a parallel monologue.
- **Multi-stakeholder negotiation** — the Vision Alignment Meeting becomes a real meeting, not a parallel survey. - **Multi-stakeholder pressure testing** — the Vision Alignment Meeting can use bounded debate rounds instead of only a parallel survey.
- **Reusable primitive** — debate mode can be invoked from any skill that today does `subagent({ tasks: [advocate, challenger] })` (currently `advisory-partner`, `brainstorming`, `requesting-code-review`). - **Reusable primitive** — debate mode can be invoked from any skill that today does `subagent({ tasks: [advocate, challenger] })` (currently `advisory-partner`, `brainstorming`, `requesting-code-review`).
**Negative** **Negative**
- **Cost grows linearly with rounds.** A 3-round debate is 3× the tokens. Budget gates need updating in `auto-budget.ts` so debate dispatches don't silently blow past the per-unit ceiling. - **Cost grows linearly with rounds.** A 3-round debate is roughly 3× the tokens. Callers should reserve budget accordingly.
- **Determinism drops.** A fire-and-collect batch is reproducible from prompts alone; a debate is path-dependent. Trace recording becomes more important — `.sf/traces/` must capture each round. - **Determinism drops.** A fire-and-collect batch is reproducible from prompts alone; a debate is path-dependent. Trace recording becomes more important — `.sf/traces/` must capture each round.
- **Synthesis complexity rises** — the parent must summarise a debate transcript, not just collect verdicts. The synthesis prompt itself becomes a tunable artefact. - **Synthesis complexity rises** — the parent must summarise a debate transcript, not just collect verdicts. The synthesis prompt itself becomes a tunable artefact.
@ -102,39 +112,32 @@ swarm({
- **Cross-session swarm replay** — a swarm session, once archived, is read-only. No "fork from round 2" support in v1. - **Cross-session swarm replay** — a swarm session, once archived, is read-only. No "fork from round 2" support in v1.
- **Human-in-the-loop debate** — swarms are agent-to-agent only. If the user wants to inject a turn, that's a different surface (the existing `discuss` flow). - **Human-in-the-loop debate** — swarms are agent-to-agent only. If the user wants to inject a turn, that's a different surface (the existing `discuss` flow).
## Implementation Sketch (Option A first) ## Implementation Notes (Option A)
1. Extend `subagent` tool: 1. `subagent` accepts `mode: "parallel" | "debate"` on `tasks` batches.
- Add `mode` field: `"parallel"` (default, current behaviour) | `"debate"`. 2. `rounds` defaults to `2`, is capped at `5`, and is valid only with
- Add `rounds` field (required when `mode = "debate"`, default `2`, max `5`). `mode: "debate"`.
2. In the dispatch layer (pi-coding-agent / sf adapter): 3. Debate requires at least two participants.
- For `mode = "debate"`: maintain an in-memory transcript per swarm. Each round, render `previous_rounds_transcript` as a context block and append it to each task's prompt. 4. Each round runs the participant tasks, then appends their outputs to an
- Per-round trace span: `swarm.<id>.round.<n>.task.<id>` so `.sf/traces/` reflects the structure. in-memory transcript.
3. Synthesis prompt: 5. Later rounds receive the transcript under `Debate transcript so far`.
- When all rounds complete, the parent receives the full transcript plus a synthesis directive: "summarise the strongest claim, the strongest objection, the convergence (if any), and the residual disagreement." 6. The final round asks each participant to end with `FINAL_VERDICT`.
4. Budget gate: 7. The parent still owns synthesis and persistence; debate mode does not create
- `auto-budget.ts` needs to multiply the projected cost by `rounds` before approving the dispatch. persistent agent messages.
5. Tests:
- Unit test: a 2-round debate produces a transcript with 4 turns (2 tasks × 2 rounds).
- Integration test: an advocate/challenger pair on a known weak design — verify the falsifier surfaces by round 3 (vs. parallel mode where it doesn't).
6. Skill updates:
- `advisory-partner` — add "for non-trivial reviews, consider `mode: 'debate'` over parallel fire".
- `brainstorming` Step 5 — same.
- `dispatching-subagents` — add a "debate mode" pattern between Pattern 2 and Pattern 3.
## Sequencing ## Sequencing
| When | Why | | When | Why |
|---|---| |---|---|
| Persistent-agent layer scoped (`SPEC.md` §17 NEW → IN PROGRESS) | Most of Option A's machinery (transcript persistence, message scoping) overlaps. | | Now | Option A is available as bounded debate mode on `subagent`. |
| Option A implemented | Forcing function — observe quality lift on adversarial reviews. | | Six months of Option A in production | Decide whether full swarm-chat with inbox is worth the build. |
| Six months of Option A in production | Decide whether Option C (full swarm-chat with inbox) is worth the build. | | Persistent-agent layer scoped (`SPEC.md` §17 NEW → IN PROGRESS) | Revisit Option C because inbox/message persistence machinery will exist. |
## References ## References
- `docs/SPEC.md` §17 (Persistent Agents) — defines `agents`, `agent_memory_blocks`, `agent_messages`, `agent_inbox`. - `docs/SPEC.md` §17 (Persistent Agents) — defines `agents`, `agent_memory_blocks`, `agent_messages`, `agent_inbox`.
- `docs/SPEC.md` §18 (Inter-Agent Messaging) — defines `send_message` tool. Currently NEW (not implemented). - `docs/SPEC.md` §18 (Inter-Agent Messaging) — defines `send_message` tool. Currently NEW (not implemented).
- `src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md` — current parallel-only contract. - `src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md` — current single/parallel/debate/chain guidance.
- `src/resources/extensions/sf/skills/advisory-partner/SKILL.md` — primary consumer of adversarial dispatch today. - `src/resources/extensions/sf/skills/advisory-partner/SKILL.md` — primary consumer of adversarial dispatch today.
- `src/resources/extensions/sf/prompts/gate-evaluate.md` — pre-execution Q3/Q4 gates. - `src/resources/extensions/sf/prompts/gate-evaluate.md` — pre-execution Q3/Q4 gates.
- `src/resources/extensions/sf/prompts/validate-milestone.md` — post-execution 3-reviewer pattern. - `src/resources/extensions/sf/prompts/validate-milestone.md` — post-execution 3-reviewer pattern.

View file

@ -27,15 +27,35 @@ The system starts from template kits, then adapts them to the repository by read
Add the contract to markdown now. Add runtime flow behavior later behind tests. Add the contract to markdown now. Add runtime flow behavior later behind tests.
The first implementation should not start by changing the worker prompt. It should add a pre-plan profile snapshot and a post-unit evidence retention hook, because those are observable and testable without changing every dispatch. Once those are stable, sf can inject harness/memory context into planning and verification prompts. The first implementation should not start by changing the worker prompt or
writing repo-local harness files. It should add a pre-plan profile snapshot and
a post-unit evidence retention hook, because those are observable and testable
without changing every dispatch. Once those are stable, sf can inject
harness/memory context into planning and verification prompts.
Near-term repository-write boundary:
- All repositories use the same sf built-in skills and harness behavior.
- sf MUST NOT generate repo-local custom skill packs such as `.agents/skills/`
for project repos.
- sf MUST NOT create tracked `harness/`, `gates/`, CI, or repo spec files as
part of normal initialization.
- The only project-level file write allowed by this stream before the explicit
harness-writer phase is sf project preferences/config, such as
`.sf/PREFERENCES.md` or `.sf/preferences.md`, when the user asks for project
preferences.
- `.sf/sf.db` may record ignored operational state, including repo profiles and
untracked-file observations. That is not repo ownership and must not be
staged by default.
| When | Flow addition | Why | | When | Flow addition | Why |
|---|---|---| |---|---|---|
| Now, in docs | Define repo profiling, untracked observation, harness planning, eval/judge rig, and memory retention contracts. | Gives implementation a stable target. | | Now, in docs | Define repo profiling, untracked observation, harness planning, eval/judge rig, and memory retention contracts. | Gives implementation a stable target. |
| First code slice | Add read-only repo profile snapshot before planning. | Lets sf understand repo shape without taking ownership. | | First code slice | Add read-only repo profile snapshot before planning. | Lets sf understand repo shape without taking ownership or writing tracked files. |
| Second code slice | Add post-unit evidence retention into `.sf/sf.db` and Singularity Memory. | Converts gate results into future guidance. | | Second code slice | Add post-unit evidence retention into `.sf/sf.db` and Singularity Memory. | Converts gate results into future guidance. |
| Third code slice | Add harness proposal generation as a planning artifact. | Keeps generated files reviewable before write. | | Third code slice | Add harness proposal generation as a planning artifact. | Produces dry-run proposals only; no tracked repo files are written. |
| Later | Inject harness/memory context into runtime prompts and workflow templates. | This changes agent behavior and needs regression fixtures. | | Later | Inject harness/memory context into runtime prompts and workflow templates. | This changes agent behavior and needs regression fixtures. |
| Explicit opt-in later | Enable Harness Writer for reviewed diffs. | Allows tracked harness files only when a unit plan claims them and the user accepts the diff. |
### Files, database, and memory ### Files, database, and memory
@ -43,7 +63,7 @@ Use all three layers, with separate responsibilities:
| Layer | Role | Examples | | Layer | Role | Examples |
|---|---|---| |---|---|---|
| Tracked repo files | Durable contract and executable harness | `SPEC.md`, `ARCHITECTURE.md`, `harness/manifest.json`, `harness/evals/*.jsonl`, `gates/*.sh`, CI workflow snippets | | Tracked repo files | Future durable contract and executable harness after explicit opt-in | `SPEC.md`, `ARCHITECTURE.md`, `harness/manifest.json`, `harness/evals/*.jsonl`, `gates/*.sh`, CI workflow snippets |
| `.sf/sf.db` | Operational state and evidence ledger | repo profile snapshots, harness inventory, eval runs, gate results, drift events, untracked-file observations | | `.sf/sf.db` | Operational state and evidence ledger | repo profile snapshots, harness inventory, eval runs, gate results, drift events, untracked-file observations |
| Singularity Memory | Cross-session knowledge | proven patterns, anti-patterns, recurring failures, repo-specific risk notes, judge calibration lessons | | Singularity Memory | Cross-session knowledge | proven patterns, anti-patterns, recurring failures, repo-specific risk notes, judge calibration lessons |
@ -152,10 +172,11 @@ Detailed design is in `repo-native-harness-architecture.md`.
| Stage | Work | Result | | Stage | Work | Result |
|---|---|---| |---|---|---|
| 1 | Add repo profile snapshots and untracked observation model. | sf understands repo shape without taking ownership. | | 1 | Add repo profile snapshots and untracked observation model. | sf understands repo shape without taking ownership. |
| 2 | Add template kit registry and harness manifest format. | sf can generate reviewable harness files. | | 2 | Add template kit registry and harness manifest format. | sf can generate dry-run harness proposals without writing repo files. |
| 3 | Add judge rig and eval suite runner. | AI and agent behavior becomes measurable. | | 3 | Add judge rig and eval suite runner. | AI and agent behavior becomes measurable. |
| 4 | Connect evidence to Singularity Memory. | Patterns and anti-patterns improve future dispatch. | | 4 | Connect evidence to Singularity Memory. | Patterns and anti-patterns improve future dispatch. |
| 5 | Add drift detection and automatic harness update proposals. | Harnesses evolve with the repo. | | 5 | Add drift detection and automatic harness update proposals. | Harnesses evolve with the repo as proposals. |
| 6 | Add explicit opt-in Harness Writer. | Reviewed repo diffs can create tracked harness files; repo-local skills remain out of scope unless separately accepted. |
## References ## References