feat: add SF skills and subagent debate mode

This commit is contained in:
Mikael Hugo 2026-04-29 17:43:30 +02:00
parent d02d33aa70
commit d78c5ac198
24 changed files with 3443 additions and 11 deletions

View file

@ -98,6 +98,13 @@ These came up during recent ports and refactor passes — tracked here so they d
| **Pi-mono SDK sync** | We pull from pi-mono directly (separate from gsd-2 sync stance). Periodically check `pi-mono/main` for SDK improvements worth taking. The remote is set up; cadence is not. | 3 | recurring |
| **Caveman input-side compression** (manual) | Caveman skill installed (output compression, ~75% fewer agent tokens). Input side — sf's own prompts (`execute-task.md`, `discuss.md`, `plan-*.md`, etc.) — is verbose: 10-step instruction lists, `runtimeContext`, `memoriesSection`, `taskPlanInline`, `slicePlanExcerpt`. Manually rewrite the heaviest sections in caveman style (preserve intent + nuance, drop fluff). Test against current to confirm no quality regression. | 2 | 1-2 days |
| **Runtime input preprocessor** (caveman-compress) | Add a transformation step in dispatch that pipes sf's rendered prompt through `caveman-compress` (sub-skill in juliusbrussee/caveman repo, ~46% input-token reduction) before LLM call. Only enable when a `terse_prompts: true` preference is set. Adds a layer that can drift from authored intent — needs a comparison harness. | 3 | 3-4 days |
| **Swarm chat / debate mode** for `subagent` tool | Today `subagent({ tasks: [...] })` runs parallel fire-and-forget — adversarial reviewers never engage each other's strongest defence. Add `mode: "debate"` + `rounds: N` so each task sees prior rounds' outputs. See [ADR-011](docs/dev/ADR-011-swarm-chat-and-debate-mode.md) — Option A (round-robin debate) first, Option C (full inbox-based swarm chat) after the persistent-agent layer (SPEC §1718) lands. | 2 | 1 week (Option A); ~3 weeks (Option C, depends on persistent-agent layer) |
| **Singularity Knowledge + Agent Platform (Go re-platform)** | Re-platform Singularity Memory from Python+FastAPI+Postgres+vchord to Go on Charm: charm-server patterns for auth/identity, fantasy as agent runtime, same Postgres+vchord for retrieval, exact wire-contract preserved. Load-bearing for cross-instance knowledge federation AND future central persistent agents (sf SPEC §17). See [ADR-014](docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md) and [`singularity-memory/MIGRATION.md`](https://github.com/singularity-ng/singularity-memory/blob/main/MIGRATION.md). | 1 | ~12 weeks across phases |
| **Wire sf to Singularity Memory remote-mode** | sf-side: change `memory-store.ts` provider chain from local-SQLite-only to remote-Singularity-Memory → embedded → local-only fallback. Once wired, ~80% of the "should sf instances interlink?" question (ADR-012) is answered for free. Depends on the platform itself being live. | 1 | 1 week post-platform |
| **sf-worker SSH host** | Build the Go-based SSH worker host for distributed execution (SPEC §22, NEW): `wish` + `xpty`/`conpty` + `promwish`. Orchestrator dispatches over SSH; worker spawns the agent in a real pty per attempt; Prometheus metrics for free. See [ADR-013](docs/dev/ADR-013-network-and-remote-execution.md). | 2 | ~3 weeks |
| **Charm TUI client (`sf-tui`)** | Build a new Go-based TUI client on `pony` + `ultraviolet` + `bubbles` + `lipgloss` + `glamour` + `huh` + `harmonica` + `x/mosaic`. Talks to sf daemon over RPC. Two-stage replacement of `pi-tui`: ship parallel as `sf --tui=charm`, reach parity, flip default, delete `pi-tui` (sheds ~10k LOC of TS from sf core). See [ADR-017](docs/dev/ADR-017-charm-tui-client.md). | 2 | ~12-16 weeks across stages |
| **Flight recorder** (`x/vcr`) | Frame-accurate session recording for sf auto-loop dispatches. Go service using `charmbracelet/x/vcr`. Records to `.sf/recordings/{unit-id}.vcr`; `sf replay <unit-id>` opens TUI player. Frame-level redaction parity with `event-log.jsonl`. See [ADR-015](docs/dev/ADR-015-flight-recorder.md). | 3 | ~3 weeks |
| **Multi-instance federation (other surfaces)** | Federated benchmarks, federated persistent agents, cross-repo unit graph — all deferred. Decide ride-Singularity-Memory vs separate service for benchmarks after §16 lands and we observe duplicated discovery cost. Cross-repo orch is out-of-scope for sf (meta-coordinator territory). Federated agents wait until concrete pain shows up. See [ADR-012](docs/dev/ADR-012-multi-instance-federation.md). | 3 | depends on which surface — re-scope after Singularity Memory lands |
It is opinionated. Each item has a tier and a one-line rationale. Reorder freely.

274
docs/SPEC_FIRST_TDD.md Normal file
View file

@ -0,0 +1,274 @@
# sf Spec-First TDD
The change-method constitution for sf. Terse and procedural — optimized for agent retrieval.
## Purpose
Every change in sf must:
1. solve a real system need
2. preserve or increase system value
3. clarify behavior before implementation
4. make tests define the contract
5. find and close gaps in what already exists
Priority: **purpose > value > contract > working code**.
If purpose and value are clear but implementation is uncertain, write contract tests first and align code to them.
## Iron Law
```
THE TEST IS THE SPEC. THE JSDOC IS THE PURPOSE. CODE EXISTS TO FULFILL PURPOSE.
NO BEHAVIOR CHANGE WITHOUT A FAILING TEST FIRST.
NO COMPLETION WITHOUT A REAL CONSUMER.
NO JUDGMENT CALL WITHOUT A CONFIDENCE AND FALSIFIER.
```
**The test is the spec** — not verification of the spec. Tests describe what the software MUST do, not what it happens to do. A test that mirrors implementation rubber-stamps bugs.
**The JSDoc is the purpose** — every exported function, type, and class opens with a one-line `Purpose:` statement. If you can't write the purpose before the code, you don't know what you're building. Purpose drives what the test asserts. Code without a stated purpose cannot be verified.
**Code exists to fulfill purpose** — not to compile, not to pass lint, not to look clean. Quality measure: does it satisfy the purpose (JSDoc) as verified by the spec (test)? Code that compiles but doesn't serve its stated purpose is a bug.
### Purposeful tests vs. mechanical tests
| Kind | Asserts | Survives refactor? |
|---|---|---|
| **Purposeful** | "claim() returns rows_affected=1 only when the lease was free or expired" | yes |
| **Mechanical** | `mockDb.update.calls.length === 1` | no |
Write purposeful tests first. They are the spec. A different implementation that passes them is equally correct. Add mechanical tests only as labelled implementation guards for specific failure modes (resource leaks, infinite loops).
### Three-tier test organization
1. **Behaviour contracts** (primary) — what the consumer receives. The spec.
2. **Degradation contracts** — what happens when dependencies fail. Consumer must always get a useful response; failure must degrade, not crash.
3. **Implementation guards** (secondary, labelled) — protect against specific failure modes. A refactor that changes internals updates guards, not behaviour contracts.
## Decomposition Path
`Vision (SPEC.md / VISION.md) → Milestone → Slice → Task → contract test → code → evidence`
Reject: `prompt → files → hope`.
Every unit (milestone, slice, task) sits in one of those rows. If a piece of work doesn't, it is unspecified.
## Purpose Gate
Every artifact (slice plan, task plan, function, test, ADR) must answer:
- **why** this behaviour exists
- **what value** it creates or protects
- **who** uses it in production (real consumer, not just tests)
- **what breaks** if it returns the wrong answer
If any answer is missing: `BLOCKED: purpose unclear — [which field is missing]`. Do not invent a plausible purpose to proceed. Surfacing the gap is more valuable than rationalising past it.
Treat the contract as a **falsifiable hypothesis**: name the evidence that would prove it wrong before implementation locks in. A contract without a falsifier is half a contract.
## Workflow (mapped to sf's phase machine)
### Research phase — name the problem
Before any plan:
- Where does this sit in `SPEC.md` / `VISION.md` / `REQUIREMENTS.md`?
- Why is it useful, who needs it, what does it enable?
- What breaks if wrong, what is out of scope?
For brownfield changes, **consumer discovery precedes purpose articulation.** Use `rg` / `git grep` to find real callers — never assume. You cannot reason about "what breaks" until you know who calls the code.
```bash
rg -nF "functionName" src/ packages/ --type=ts
git grep -n "functionName"
```
If you can't name a real consumer, stop. Don't add code yet.
### Plan phase — clarify before deciding
Clarify highest-impact unknowns first: behaviour, acceptance criteria, data invariants, failure handling, security, integration boundaries.
For non-trivial contracts, pressure-test before locking the plan via the [`advisory-partner`](../src/resources/extensions/sf/skills/advisory-partner/SKILL.md) skill — this is sf's adversarial review surface, already wired into the Q3/Q4 gates and `validate-milestone`. It runs with the **validation** model, distinct from the planning/execution model — that's the point.
1. **Advocate pass** — strengthen the best version of the contract.
2. **Challenger pass** — attack assumptions AND propose an alternative. A challenger anchored to the advocate's framing is not adversarial.
3. **Falsifier (required gate, blocks Plan→Execute):** `FALSIFIER: this contract is wrong if [specific observable condition].` Generic falsifiers ("wrong if it doesn't work") are process failures.
**Find the devil and find the experts:**
- **Devil** — finds the specific failure that compounds silently: wrong assumption → wrong test → wrong code → wrong evidence, all passing.
- **Experts** — domain specialists who know what right looks like. Pick expertise matching the decision: SRE (reliability), security (trust boundary), distributed systems (consistency), API reviewer (ergonomics).
Both forces must act on the contract before it becomes tests. One strong pass each, unless concrete risk remains.
### Plan from contracts, not files
**Purpose re-check:** restate purpose from the Research step in one sentence. If the plan now serves a different purpose, the contract drifted — go back.
Each behaviour slice defines: consumer, contract, code path, validation, falsifier.
| Good | Bad |
|---|---|
| Add failing test proving `claim()` rejects expired-lease takeover when `claim_until > now()`. | Edit `src/resources/extensions/sf/auto-dispatch.ts`. |
### TDD phase — write the test first
1. Write the failing test.
2. Make it fail for the **right** reason (feature missing, not typo).
3. Only then write production code.
**Purpose re-check:** does this test prove behaviour serving the stated purpose?
Test types:
| Behaviour | Test type |
|---|---|
| Pure logic, local invariants | Unit |
| Interface/schema contracts | Contract |
| Storage, orchestration, multi-component | Integration |
| Existing behaviour you must preserve | Characterisation |
| State machines, routing, normalisation | Property/invariant |
Test naming: `test_<what>_<when>_<expected>` or describe-blocks structured the same way. The name **is** the contract claim.
```
node --test --experimental-test-isolation=process dist-test/path/to/file.test.js
```
If it passes immediately, you're testing existing behaviour. Fix the test.
### Execute phase — minimal production code
Smallest change that makes the spec (test) green while serving the purpose (JSDoc). Nothing more. No YAGNI violations, no surrounding cleanup.
Do not weaken the test to fit sloppy code — fix the code. Code that compiles and passes lint but doesn't fulfil its stated purpose is a bug.
### Verify phase — green, lint, type-check
```bash
npm run typecheck:extensions
npm test
```
All tests green. Zero lint/type errors. Then refactor while green.
### Review phase — verify usefulness
**Purpose re-check (final):** does the code serve a real production consumer?
Verify: who calls it (`rg` for usages), what production path depends on it, what signal would reveal breakage. **If only tests call it, it is not finished or not needed.**
**Falsifier follow-through:** re-check the falsifier from the Plan phase. If the falsifier is observable post-deploy, add it to monitoring or to the unit's verification commands. A falsifier that is never checked after deploy is half a contract.
**Zero callers ≠ zero purpose.** Before deleting: does it serve an unmet need (wire it in) or is it superseded (delete it)? Never test for absence of old code — test that new behaviour works.
### Confidence Gate (between phases)
After completing a step, state confidence as a number `0.01.0` and a one-line reason. The number forces a pause to assess rather than plowing ahead on momentum.
| Step | Threshold | Below threshold |
|---|---|---|
| Purpose & consumer | 0.95 | Run an adversarial review wave (advisory-partner Q3/Q5). |
| Contract test | 0.90 | Adversarial review wave. |
| Implementation | 0.95 | Add a specialist reviewer for the touched boundary (e.g. provider/transport/security). |
| Final evidence | 0.97 | Full adversarial: advocate + challenger + specialist. |
Skip the gate for trivial steps (typo fix, exhaustive matches with full coverage). The gate earns its keep on I/O boundaries, async loading, protocol integration, and anything touching real backends or models.
LLM confidence numbers are poorly calibrated in absolute terms — the *relative* signal matters. If you write 0.7, you know you're guessing. Act on that.
## Tests Find Gaps
Testing existing code is one of the highest-value activities sf can do. A test that reveals an existing gap is more valuable than one validating new code — the gap was compounding in production.
High-value gap tests:
- **Purpose** — does this module do what its JSDoc claims?
- **Fallback** — does failure surface or get masked?
- **Persistence** — does state survive restart? (especially `.sf/sf.db`, `.sf/runtime/*.json`)
- **Boundary** — what happens at empty input, max value, network partition, expired claim?
- **Contract** — does the caller get what it expects?
When a test fails against existing code, fix the code. The test told you what was broken.
50 tested features > 500 untested ones.
## Test Rules
- **Test first.** Without it, you mirror implementation — bugs and all.
- **Bug = missing correct-behaviour test.** Write a test for the *correct* behaviour first; it must fail (RED) because the bug exists. If it passes immediately, the test is wrong (testing the broken behaviour) — fix the test, not the code.
- **Bug reports → failing regression test first.**
- **Behaviour change without tests is incomplete.**
- **Bad tests produce bad code.** A test validating silent failure is wrong — rewrite it.
- **Test through the public contract.** Don't expose `_helpers` for testability; assert through real callers.
- **Test pin behaviour, not internal decomposition.** A test that breaks on refactor without behaviour change is mechanical, not purposeful.
- **Critical invariants may need property tests, not just examples** (e.g. ULID monotonicity, claim race, idempotent migrations).
- **Fix code to satisfy live-contract tests. Fix or delete tests encoding stale behaviour.**
- **Fallbacks must deliver working behaviour or not exist.** A fallback that silently returns nothing is worse than none.
## Test Boundaries
- Test through the public contract that production consumers use.
- Do not promote `_helper` to `helper` for testing convenience.
- Assert through public methods, not implementation detail.
- Tests pin behaviour, not internal decomposition.
- For Node.js native test runner: `async` test functions and `await`; never call `.then()`/`.catch()` chains in test bodies when `await` expresses the same contract.
## Self-Modification Boundary
sf modifies its own codebase via the auto-loop. Without a protected zone, constitutional drift is silent.
**Protected files (human approval required):**
`SPEC.md`, `BUILD_PLAN.md`, `UPSTREAM_PORT_GUIDE.md`, `AGENTS.md`, `CLAUDE.md`, `CONTRIBUTING.md`, `docs/SPEC_FIRST_TDD.md`, every `docs/dev/ADR-*.md`.
Autonomous agents may propose changes but must not merge to these without human review.
**Test infrastructure** (`tests/`, `*.test.ts`, `tsconfig*.json`, lint config) requires advocate/challenger/falsifier — a change to test infra can make all future tests pass vacuously. Treat test-infra changes as governance-adjacent: they alter the validity of every test that runs after them. A corrupted test runner is more dangerous than a corrupted test.
## Evidence
Required for production-impacting changes:
- failing test → passing test → type-check → lint
- advocate's strongest support, challenger's strongest opposition, falsifier + outcome
- runtime evidence: traces (`.sf/traces/`), event log (`.sf/event-log.jsonl`), gate results
- for non-trivial runtime/provider fixes: explicit repro before code, solved boundary after code
Persist learning: when a unit produces a gotcha or anti-pattern, write to sf's memory store (`memories` table) so the next unit sees it. Evidence that only lives in the conversation dies on restart.
## Degraded Operation
| Dependency down | Behaviour |
|---|---|
| Native engine (`forge_engine.node`) | Fall back to JS implementations; log degraded mode. Never silently proceed without confirming fallback path is wired. |
| `node:sqlite` and `better-sqlite3` both unavailable | Filesystem-derived state (no DB); log degraded discovery. Block any operation that requires durable state. |
| LLM provider | Try next allowed provider per `~/.sf/preferences.md`; if exhausted, halt unit with `ErrModelUnavailable` (no silent skip). |
| SOPS unavailable | Use already-exported env vars; log that secret refresh is unavailable. Block secret-touching commands. |
When a dependency is down: operate in defined degraded mode or stop. Never silently proceed.
## Task Template
Each task:
**Purpose** (need + why) → **Consumer** (who depends) → **Contract** (test proving it) → **Implementation** (code changes) → **Evidence** (test + lint + runtime signal).
If a task cannot be described this way, it is underspecified.
## See Also
- [`AGENTS.md`](../AGENTS.md) — repo guidelines, build/test/lint commands.
- [`SPEC.md`](../SPEC.md) — sf v3 specification (what we're building).
- [`UPSTREAM_PORT_GUIDE.md`](../UPSTREAM_PORT_GUIDE.md) — porting from pi-mono / gsd-2.
- [`src/resources/extensions/sf/skills/advisory-partner/SKILL.md`](../src/resources/extensions/sf/skills/advisory-partner/SKILL.md) — adversarial review framework.
- [`src/resources/extensions/sf/skills/code-review/SKILL.md`](../src/resources/extensions/sf/skills/code-review/SKILL.md) — multi-lens review skill.
## References
- GitHub Spec Kit — spec-first authoring patterns.
- Ousterhout, *A Philosophy of Software Design* — deep modules, contract pattern.
- Trail of Bits — anti-rationalisation rules.
- ACE — original Iron Law / Purpose Gate framing this doc adapts.

View file

@ -0,0 +1,140 @@
# ADR-011: Swarm chat and debate mode for ephemeral subagents
**Date**: 2026-04-29
**Status**: proposed (deferred — capture for future implementation)
## Context
sf's `subagent` tool today dispatches one or more subagents in **parallel fire-and-forget** mode (`subagent({ tasks: [...] })`). All tasks run concurrently; none see each other; the parent collects results and synthesises.
This is sufficient for many cases (parallel research, parallel gate evaluation), but it has a structural gap for **adversarial review** and **multi-stakeholder negotiation**:
- An advocate's strongest defence never gets stress-tested by the challenger — they fire monologues in parallel.
- A multi-stakeholder swarm (the canonical Vision Alignment Meeting roles in `plan-milestone`: PM, User Advocate, Combatant, Architect, …) never actually negotiates; each issues a verdict the parent then weighs.
- The parent is the only synthesiser — there's no convergence dynamic among the subagents themselves.
The user asked whether agent-to-agent communication could happen inside ephemeral swarm tasks, sharing the chat machinery rather than waiting for the long-lived persistent-agent layer (SPEC §1718) to land.
## Decision
**Defer.** Capture the design in this ADR and a `BUILD_PLAN.md` row. Implement after the persistent-agent layer (`agents`, `agent_messages`, `agent_inbox`, `send_message` tool) lands as a NEW tier, since 90 % of the machinery is shared. Implement Option A (debate mode) first as a forcing function — once we see how much real debate improves outcomes, the case for full swarm-chat (Option C) writes itself.
## Alternatives Considered
### Option A — Round-robin debate mode (RECOMMENDED first)
Add `mode: "debate"` and `rounds: N` to the `subagent` tool. Each round, every task sees the previous round's outputs.
```
subagent({
mode: "debate",
rounds: 3,
tasks: [
{ id: "advocate", model_tier: "validation", prompt: "Make case for X. ..." },
{ id: "challenger", model_tier: "validation", prompt: "Attack X. ..." }
]
})
```
- **Cost**: `rounds × tasks` tokens.
- **Determinism**: still reasonable — outputs are sequenced deterministically per round.
- **Fit**: best for adversarial review where the challenger should engage with the advocate's strongest defence. Minor extension of the existing `subagent` contract.
- **Why not**: doesn't support free-form many-to-many messaging. Each task speaks once per round in a fixed order.
- **Why first**: smallest change, biggest immediate quality win, reusable as a primitive.
**Effort**: ~1 dev-week. Touches: `subagent` tool definition, dispatch path in pi-coding-agent, new test cases, `dispatching-subagents` skill section, possibly `advisory-partner` skill update.
### Option B — Shared scratchpad
Subagents share a JSON scratchpad written between turns. Each subagent reads what the others wrote, appends, hands off.
- **Pros**: state is explicit and auditable; low protocol complexity.
- **Cons**: feels mechanical — agents don't "talk", they write to a buffer. No spontaneous response.
- **Verdict**: rejected. If we're going to add inter-agent state, do it as messaging (Option A or C), not a buffer.
### Option C — Ephemeral swarm with inbox (long-term target)
Reuse the persistent-agent infrastructure from `SPEC.md` §1718 (`agent_inbox`, `agent_messages`, `send_message` tool — currently NEW, not implemented) but scope each ephemeral swarm by `swarm_id` with a TTL. Swarm agents can `send_message` to each other freely during the task; on `synthesize()`, the swarm's rows get archived.
```
swarm({
ttl_ms: 600_000,
agents: [
{ id: "pm", model_tier: "planning", system: "..." },
{ id: "user", model_tier: "validation", system: "..." },
{ id: "combatant", model_tier: "validation", system: "..." },
{ id: "architect", model_tier: "validation", system: "..." }
],
initial: { from: "moderator", to: "all", content: "Roadmap proposal: ..." }
})
```
- **Pros**: open negotiation; most powerful for multi-stakeholder Vision Alignment Meeting; reuses persistent-agent machinery.
- **Cons**: path-dependent (harder to reproduce); harder to budget tokens; swarm convergence isn't guaranteed without a moderator. Depends on the persistent-agent layer landing first.
- **Verdict**: target end state; not first.
## Consequences
**Positive**
- **Higher-quality adversarial review** — the challenger actually engages the advocate's strongest defence, instead of issuing a parallel monologue.
- **Multi-stakeholder negotiation** — the Vision Alignment Meeting becomes a real meeting, not a parallel survey.
- **Reusable primitive** — debate mode can be invoked from any skill that today does `subagent({ tasks: [advocate, challenger] })` (currently `advisory-partner`, `brainstorming`, `requesting-code-review`).
**Negative**
- **Cost grows linearly with rounds.** A 3-round debate is 3× the tokens. Budget gates need updating in `auto-budget.ts` so debate dispatches don't silently blow past the per-unit ceiling.
- **Determinism drops.** A fire-and-collect batch is reproducible from prompts alone; a debate is path-dependent. Trace recording becomes more important — `.sf/traces/` must capture each round.
- **Synthesis complexity rises** — the parent must summarise a debate transcript, not just collect verdicts. The synthesis prompt itself becomes a tunable artefact.
**Risks and mitigations**
- *Risk:* runaway debate — agents loop without converging.
- *Mitigation:* hard `rounds` cap; convergence heuristic (stop when no new claims appear in a round).
- *Risk:* one agent dominates and silences the others.
- *Mitigation:* moderator role injects a turn-order constraint; per-agent token budget within a round.
- *Risk:* debate quality is only marginally better than parallel-fire-and-collect.
- *Mitigation:* A/B harness — run both modes on the same fixture set, compare verdict accuracy on a benchmark of known good/bad designs. If the lift is < 10 % accuracy, defer Option A indefinitely.
## Out of Scope
- **Persistent inter-agent messaging across runs** — covered by SPEC §1718 (`agent_inbox`, `agent_messages`); orthogonal to ephemeral swarms.
- **Cross-session swarm replay** — a swarm session, once archived, is read-only. No "fork from round 2" support in v1.
- **Human-in-the-loop debate** — swarms are agent-to-agent only. If the user wants to inject a turn, that's a different surface (the existing `discuss` flow).
## Implementation Sketch (Option A first)
1. Extend `subagent` tool:
- Add `mode` field: `"parallel"` (default, current behaviour) | `"debate"`.
- Add `rounds` field (required when `mode = "debate"`, default `2`, max `5`).
2. In the dispatch layer (pi-coding-agent / sf adapter):
- For `mode = "debate"`: maintain an in-memory transcript per swarm. Each round, render `previous_rounds_transcript` as a context block and append it to each task's prompt.
- Per-round trace span: `swarm.<id>.round.<n>.task.<id>` so `.sf/traces/` reflects the structure.
3. Synthesis prompt:
- When all rounds complete, the parent receives the full transcript plus a synthesis directive: "summarise the strongest claim, the strongest objection, the convergence (if any), and the residual disagreement."
4. Budget gate:
- `auto-budget.ts` needs to multiply the projected cost by `rounds` before approving the dispatch.
5. Tests:
- Unit test: a 2-round debate produces a transcript with 4 turns (2 tasks × 2 rounds).
- Integration test: an advocate/challenger pair on a known weak design — verify the falsifier surfaces by round 3 (vs. parallel mode where it doesn't).
6. Skill updates:
- `advisory-partner` — add "for non-trivial reviews, consider `mode: 'debate'` over parallel fire".
- `brainstorming` Step 5 — same.
- `dispatching-subagents` — add a "debate mode" pattern between Pattern 2 and Pattern 3.
## Sequencing
| When | Why |
|---|---|
| Persistent-agent layer scoped (`SPEC.md` §17 NEW → IN PROGRESS) | Most of Option A's machinery (transcript persistence, message scoping) overlaps. |
| Option A implemented | Forcing function — observe quality lift on adversarial reviews. |
| Six months of Option A in production | Decide whether Option C (full swarm-chat with inbox) is worth the build. |
## References
- `docs/SPEC.md` §17 (Persistent Agents) — defines `agents`, `agent_memory_blocks`, `agent_messages`, `agent_inbox`.
- `docs/SPEC.md` §18 (Inter-Agent Messaging) — defines `send_message` tool. Currently NEW (not implemented).
- `src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md` — current parallel-only contract.
- `src/resources/extensions/sf/skills/advisory-partner/SKILL.md` — primary consumer of adversarial dispatch today.
- `src/resources/extensions/sf/prompts/gate-evaluate.md` — pre-execution Q3/Q4 gates.
- `src/resources/extensions/sf/prompts/validate-milestone.md` — post-execution 3-reviewer pattern.

View file

@ -0,0 +1,110 @@
# ADR-012: Multi-instance federation — when sf instances interlink
**Date**: 2026-04-29
**Status**: proposed (deferred — capture for future implementation)
## Context
sf today is **per-project**: each project has its own `.sf/sf.db`, and a single daemon (`packages/daemon`) on a host serves all projects under its scan roots. As deployment grows beyond one host (laptop, `mikki-bunker`, `aidev`), the question arises: should sf instances on different hosts (or different projects on the same host) interlink? And if so, on which surfaces?
Without thought-out federation, instances repeatedly re-learn the same lessons — anti-patterns, model outages, provider quirks — wasting tokens and duplicating mistakes. With over-eager federation, sf inherits cross-host trust, schema-version, and latency problems it doesn't need yet.
This ADR maps the federation surfaces, takes a position on each, and sequences the work.
## Decision
**Defer most federation. Wire Singularity Memory first as the single load-bearing federation primitive; defer federated benchmarks, cross-repo orchestration, and federated agents until the pain is concrete.**
## Federation Surfaces
### Surface 1 — Knowledge (anti-patterns, learnings, contracts)
**Status:** designed in `SPEC.md` §16 — Singularity Memory (`sm`) is the explicit cross-instance knowledge layer. HTTP + MCP server holding memories, learnings, anti-patterns. Runs embedded (single-user sf) or remote (shared service on tailnet, reachable from sf, Hermes, OpenClaw, Claude Code, Cursor).
**Code reality:** not yet wired. `src/resources/extensions/sf/memory-store.ts` and `memory-extractor.ts` write to a local SQLite `memories` table. The spec's "remote-mode" isn't connected.
**Decision:** **wire it.** Singularity Memory is the load-bearing federation primitive. If Mikki learns "Provider X drops requests at 03:00 UTC", that anti-pattern should be reachable from any sf instance on the tailnet without re-learning. Once wired, ~80 % of the "should they interlink?" question answers itself.
### Surface 2 — Benchmarks and circuit breakers
**Status:** per-DB today. `benchmark_results` and `circuit_breakers` tables live in each project's `.sf/sf.db`. One instance trips a breaker on `kimi-coding/k2p5`; another instance has to independently rediscover the outage.
**Decision:** **defer; revisit after Singularity Memory lands.** Two clean options when we revisit:
- **Ride Singularity Memory** — store benchmark observations as a memory category, recall as needed. Cheap; semantically clean (benchmarks ARE learning).
- **Separate thin HTTP service** — purpose-built benchmark aggregator with statistical smoothing and a publish/subscribe channel for circuit-breaker events.
The pain ceiling is bounded today (per-instance discovery is at worst a few wasted dispatches). Only build when concrete cost emerges.
### Surface 3 — Cross-project unit dependencies
**Status:** not designed. sf has no concept of "milestone in repo A produces an artefact repo B depends on". The unit hierarchy (milestone → slice → task) is project-local.
**Decision:** **out of scope for sf.** Cross-repo orchestration is a different abstraction layer — it belongs in a meta-coordinator that consumes sf's MCP API, not in sf itself. Building it inside sf would conflate "agent that ships one project" with "fleet manager that ships an org's roadmap." Different products.
### Surface 4 — Federated persistent agents
**Status:** not designed. `SPEC.md` §17 (NEW) introduces persistent agents, but scopes them to a single project's DB.
**Decision:** **defer.** Per-instance for v3. If Mikki has a "code-reviewer" persistent agent, it lives in Mikki's DB. Federation requires:
- Cross-host auth (who can wake whose agents).
- Agent-state schema versioning (instances may run different sf versions).
- Leader-election story for shared-agent updates.
- A migration path from per-instance → federated.
None of this earns its keep until we have a concrete use case where one agent should genuinely serve multiple projects/hosts. Premature now.
### Surface 5 — Distributed execution (clarifying note, not federation)
**Status:** spec'd in `SPEC.md` §22 (NEW); not built. SSH workers — one daemon dispatches units to remote worker hosts.
**Decision:** **clarify that this is NOT federation.** Distributed execution = one daemon owns many workers (parallel scaling). Federation = many daemons share state across hosts (knowledge sharing). Different problems. The spec already separates them; this ADR just affirms the line.
## Consequences
**Positive (after Singularity Memory lands)**
- **Knowledge sharing without re-learning** — anti-patterns, gotchas, contract findings reachable across hosts and other agent products on the tailnet.
- **Lower per-instance cost** — fewer wasted dispatches re-discovering provider quirks.
- **Reusable for non-sf agents** — Hermes, Claude Code, Cursor can also read/write Singularity Memory, so the network effect grows beyond sf.
**Negative**
- **Tailnet dependency** — when remote-mode Singularity Memory is configured, tailnet outage degrades sf to local-only. Mitigation: spec already allows embedded (in-process) mode; remote is opt-in.
- **Cross-instance prompt-injection surface** — a malicious memory written by one instance could leak into another's recall. Mitigation: Singularity Memory MUST track provenance per memory and let consumers filter by trusted source. Capture as a sub-ADR if/when implemented.
- **Schema versioning across instances** — different sf versions accessing the same memory store. Mitigation: memory schema must be append-only and additive; new fields are optional reads.
**Risks and mitigations**
- *Risk:* Singularity Memory becomes a bottleneck — sf can't dispatch when memory is down.
- *Mitigation:* sf MUST treat memory as best-effort. A memory-fetch failure logs degraded-mode and proceeds with empty recall. Local SQLite stays as the authoritative scheduler state (per `SPEC.md` §3).
- *Risk:* federated benchmarks make sf overconfident in stale data.
- *Mitigation:* every benchmark observation carries `recorded_at` and `host`. Consumers weight by recency and reject stale data older than `circuit_breaker_resets_at + N`.
- *Risk:* cross-instance attacker plants poisoned anti-patterns to steer agent behaviour.
- *Mitigation:* same as the prompt-injection mitigation above — provenance + trusted-source filter, plus rate-limiting per writer.
## Out of Scope
- **Cross-repo unit graph** — meta-coordinator territory.
- **Federated persistent-agent fleets** — defer until concrete pain.
- **Multi-tenant Singularity Memory** — current design assumes a single-user-or-team trust domain. Multi-tenant is a separate product.
- **Auto-sharding sf instances** — sf is one daemon per host; we don't horizontally split a single host's daemon.
## Sequencing
| When | Action |
|---|---|
| Tier 1+ (next 13 months) | Wire Singularity Memory remote-mode in `memory-store.ts`. Provider chain fallback: remote → embedded → local-only. Update `SPEC.md` §16 status from PARTIAL to EXISTS once landed. |
| After Singularity Memory in production for 1+ month | Decide whether to ride it for benchmarks (Surface 2) or build a separate service. Decision driven by observed cost of duplicated benchmark discovery. |
| If/when concrete cross-instance agent pain shows up | Reopen Surface 4 (federated persistent agents). Don't pre-build. |
| Never in sf | Surface 3 (cross-repo unit deps) — that's a separate product. |
## References
- `docs/SPEC.md` §16 — Singularity Memory (Knowledge Layer).
- `docs/SPEC.md` §1718 — Persistent agents and inter-agent messaging (single-instance scope).
- `docs/SPEC.md` §22 — Distributed Execution (SSH workers — *not* federation).
- `src/resources/extensions/sf/memory-store.ts` — current local-only memory store.
- `packages/daemon/src/daemon.ts` — single-host daemon process.
- `docs/dev/ADR-011-swarm-chat-and-debate-mode.md` — related: ephemeral swarms within a single instance.

View file

@ -0,0 +1,104 @@
# ADR-013: Network and remote-execution layer
**Date**: 2026-04-29
**Status**: proposed (deferred — capture for staged execution)
## Context
sf today runs as a single daemon per host. Three forces push it toward a multi-host topology:
- **SSH workers** (`SPEC.md` §22, NEW): the orchestrator dispatches unit attempts to remote hosts (GPU, Windows, parallel scaling) — needs an SSH-served worker process.
- **Singularity Memory remote-mode** (ADR-012, ADR-014, sf SPEC §16): the cross-instance knowledge layer runs as a service on the tailnet, reachable from sf, Hermes, OpenClaw, Claude Code, Cursor.
- **Multi-instance federation** (ADR-012): future federated agents and benchmarks ride the same network substrate.
This ADR fixes the network and SSH-execution layer the above all depend on.
## Decision
- **Network substrate: tailnet** — Tailscale wire protocol with **Headscale** as the self-hosted control plane (the user already runs Headscale at `mikki-bunker`). sf core is wire-agnostic; it assumes addressable, authenticated peers.
- **SSH worker host stack: Go + `charmbracelet/wish` + `charmbracelet/x/xpty`** (Linux/macOS) and **`charmbracelet/x/conpty`** (Windows). One thin Go shim per worker host; orchestrator (TS) talks SSH stdio to it.
- **Worker observability: `charmbracelet/promwish`** — Prometheus middleware mounted on Wish gives `/metrics` for free.
- **Worker identity: `charmbracelet/x/sshkey` + `charmbracelet/melt`** — auto-provisioning + Ed25519-with-seed-words backup.
## Alternatives Considered
### Network substrate
- **Public internet + sshd + manual key management** — works, but key sprawl is a real problem (each new host adds N×M keys), and dynamic IPs break stable hostnames. Tailnet's MagicDNS + ACLs replace both. Rejected.
- **Plain WireGuard mesh** — no control plane; manual peer config. Higher ops overhead than Headscale. Rejected.
- **Tailscale-the-service** — fine, but Headscale is already running and self-hosted means full ownership. Rejected.
- **ZeroTier / Netbird** — viable alternatives. Rejected because the user already has Headscale and switching costs nothing-to-gain.
### SSH worker stack
- **Node-based SSH server (`ssh2` lib)** — keeps everything TS but reinvents what Wish gives for free; no battle-tested middleware patterns. Rejected.
- **OpenSSH `sshd` with `ForceCommand`** — works for simple cases, terrible for multiplexed agent dispatch with per-connection state. Rejected.
- **Plain Go `crypto/ssh`** — lower-level than Wish, no middleware, no built-in metrics. Rejected — Wish wraps the right primitives.
## Consequences
**Positive**
- sf's network model is **explicit**: tailnet first, ACLs in Headscale's admin, no per-service auth invention.
- SSH worker host inherits Wish's mature middleware (`wish/logging`, `wish/elapsed`, etc.) and `promwish` observability.
- Cross-platform pty support (`xpty` Linux/macOS, `conpty` Windows) lets workers spawn real ttys for the agent — load-bearing for Windows-only test runs on `mikki-bunker-windows`.
- Stable hostnames via Headscale's MagicDNS — `mikki-bunker.tailnet.ts.hugo.dk` resolves regardless of network change.
- Identity story is clean: each worker host has its own Ed25519 keypair (`sshkey`), backed up via `melt` seed words.
**Negative**
- Tailnet dependency: when Headscale is down, *new* connections can't auth (existing connections survive). Mitigation: Headscale on a stable host with monitoring.
- Polyglot deployment: TS orchestrator + Go worker. One clean SSH-stdio boundary, but two languages to keep in CI. Acceptable per ADR-016 (parallel build).
- ACL drift: if Headscale ACLs forbid a worker host, sf degrades silently. Doctor-check should detect and surface explicitly (see "implementation" below).
**Risks and mitigations**
- *Risk:* SSH disconnect mid-turn produces zombie agent processes (SPEC §22.3).
- *Mitigation:* spec-mandated remote-cleanup script on disconnect; `--sf-run-id=<id>` marker on the agent process for `pgrep` / `kill`.
- *Risk:* `wish` API churn pre-1.0.
- *Mitigation:* pin a version; planned upgrade window once per quarter.
- *Risk:* `xpty` / `conpty` edge cases on niche shells.
- *Mitigation:* worker has a flag to fall back to non-pty stdio; logged loudly.
## Out of Scope
- **Multi-tenant network isolation** (one tailnet, multiple users with separate ACL domains) — defer until concrete need.
- **Public-internet exposure** — sf is tailnet-only by deployment recommendation. If a use case needs a public endpoint, it goes through `tailscale funnel` or a dedicated reverse proxy outside sf.
- **Cross-tailnet federation** — out of scope; one tailnet per deployment.
## Sequencing
| When | Action |
|---|---|
| Now | Capture this ADR as the deployment assumption. |
| Tier 1 (next 13 months) | Build sf-worker (Go + Wish + xpty/conpty + promwish) as a separate package or repo. Orchestrator-side dispatch path in TS already plans for `worker_host` per SPEC §22 — just point it at the SSH endpoint. |
| Tier 2 | Doctor check: validate tailnet ACL allows the orchestrator → all configured worker hosts. Surface failures in `sf doctor`. |
| Tier 3 | Worker auto-provisioning script: `sf worker bootstrap <host>` generates a key, registers with Headscale, drops the worker binary. |
## Implementation Sketch
```
[sf orchestrator (TS)] on the daemon host
│ ssh user@worker.tailnet.ts.hugo.dk -- carries sf-rpc envelope
[sf-worker (Go)] on each worker tailnet node
├── wish.Server with logging + elapsed + promwish middleware
├── per-connection handler spawns the agent via xpty/conpty
├── /metrics via promwish — scraped by your Prometheus
└── /healthz, /readyz simple HTTP for orchestrator health checks
```
The worker is **stateless** — claim, lease, retry, persistence are all the orchestrator's job (per SPEC §22). The worker just executes one attempt at a time and streams output.
## References
- `SPEC.md` §22 — Distributed Execution.
- `ADR-012` — Multi-instance federation (this ADR provides the substrate).
- `ADR-014` — Singularity Knowledge + Agent Platform (deploys onto this substrate).
- `ADR-016` — Charm AI stack adoption strategy (frames why Go for new services).
- `charmbracelet/wish` — SSH server framework.
- `charmbracelet/x/xpty`, `charmbracelet/x/conpty` — pty primitives.
- `charmbracelet/promwish` — Prometheus middleware for Wish.
- Headscale — open-source Tailscale control plane.

View file

@ -0,0 +1,110 @@
# ADR-014: Singularity Knowledge + Agent Platform stack
**Date**: 2026-04-29
**Status**: proposed (deferred — capture for staged execution)
## Context
`SPEC.md` §16 defines a cross-instance knowledge layer (Singularity Memory). `SPEC.md` §1718 defines persistent agents and inter-agent messaging (status NEW). sf instances today carry their own local memory store (`memory-store.ts`); persistent agents are not implemented at all.
Two trajectories converge:
- **Knowledge federates** — anti-patterns, learnings, contracts should be reachable across sf instances and across other agent products on the tailnet (Hermes, OpenClaw, Claude Code, Cursor).
- **Persistent agents centralise** — long-lived cross-project agents (code-reviewer with cross-project memory, memory-curator, security-auditor, build-watch) are too heavy and too cross-cutting to live per-project.
These two needs collapse into one service: the **Singularity Knowledge + Agent Platform** — a single Go server hosting the federated memory store *and* the central persistent-agent runtime.
This ADR fixes the stack.
The implementation arm of this ADR lives in [`singularity-memory/MIGRATION.md`](https://github.com/singularity-ng/singularity-memory/blob/main/MIGRATION.md).
## Decision
- **Language: Go.**
- **Storage backbone: Postgres + vchord** (existing) — accessed from Go via `pgx`. No data migration; same schema, same vchord index.
- **Identity / auth / sync layer: `charmbracelet/charm`-server patterns** — SSH-key identity, JWT issuance, encrypted KV for user-level prefs and config. Adopted as ported library code; not run as a sidecar.
- **Agent runtime: `charmbracelet/fantasy`** — multi-provider LLM access (Anthropic, OpenAI, Google, Bedrock, OpenRouter, etc. via `catwalk`). Used for embeddings/summarisation today; for full central persistent agents tomorrow.
- **HTTP API: Go `net/http` + chi or echo router**, serving the *exact* current OpenAPI contract.
- **MCP server: same wire protocol** as today's Python implementation. Clients (sf, Hermes, OpenClaw, Claude Code, Cursor) keep working unchanged.
- **CLI scaffolding: `charmbracelet/fang`.**
- **Observability: `promwish`-style Prometheus metrics**, scraped from a shared metrics endpoint.
- **Admin UI (Phase 3): `pony` + `ultraviolet`** for the view layer (reversed from earlier deferral; now adopted as a deliberate foundation bet — admin UI tolerates churn better than user-facing surfaces). Served over SSH via `wish`.
## Alternatives Considered
### Stack
- **Stay Python + FastAPI + Postgres.** Status quo. Works today.
- *Rejected:* misses the foundation bet for central persistent agents (sf SPEC §17). Building those on Python + raw OpenAI/Anthropic SDK calls means retrofitting fantasy-style agent semantics later — real refactor cost. The trigger to migrate isn't pain in the current server; it's foundation laying for what comes next.
- **Rust + axum + Postgres.** Uniformly fast, but Charm's agentic ecosystem (fantasy, catwalk, wish, charm-server, the entire Bubble Tea family) is Go-native. Rust on the server side would mean reimplementing those abstractions or shelling out. Rejected — wrong ecosystem.
- **TypeScript + Node + Postgres.** Keeps language alignment with sf core. But sf is moving toward parallel-build (ADR-016): TS in sf core, Go in new services. The Node ecosystem doesn't have an equivalent to fantasy + charm-server + Wish. Rejected.
### Storage backbone
- **Replace Postgres + vchord with `charm-server`'s native KV.** `charm-server` is a personal/team encrypted KV; it's not a vector DB or BM25 index. We'd lose retrieval sophistication. Rejected.
- **Replace Postgres with `sqlite-vec`.** Embeddable single-binary deployment is appealing, but BM25 quality on `tsvector` is hard to match without a full re-tune, and we'd be redoing data migration on top. Rejected for v1; revisit in a v2 retrieval ADR if the Go server needs to ship without Postgres.
- **Keep Postgres + vchord, connect via Go `pgx`.** ← chosen. Battle-tested retrieval, zero data migration, focus the migration on language/runtime/agent-platform changes only.
### Agent runtime
- **Direct SDK calls (`anthropic-sdk-go`, `openai-go`, `go-genai`).** Simplest for today's narrow LLM use (embeddings + summarisation). But future central persistent agents need agent-loop semantics (multi-turn, tool calls); building those on raw SDKs reinvents fantasy's abstractions. Rejected — foundation bet.
- **Build our own agent runtime in Go.** Pure NIH. Rejected.
- **`charmbracelet/fantasy`.** ← chosen. 730 stars, actively developed, clean API, multi-provider via `catwalk`.
## Consequences
**Positive**
- **Foundation is right** for central persistent agents (sf SPEC §17). Adding new agents means defining their tools and system prompt, not rebuilding the runtime.
- **Single static Go binary** is operationally simpler than Python uv/venv + Alembic + worker on each deployment host.
- **Charm ecosystem alignment** with sf-worker (ADR-013), flight recorder (ADR-015), Charm TUI client (ADR-017). One language for the new-services tier.
- **Wire contract preserved** — clients are zero-touch.
**Negative**
- **Migration is a real undertaking** — ~12 weeks total, with the recall endpoint as the critical parity gate. See `MIGRATION.md`.
- **Polyglot deployment grows** — Python (during transition) + Go (new) + TS (sf core) + Rust (sf native). Bounded; once Python retires, three languages with clear boundaries.
- **`fantasy` and `pony` are pre-1.0** — API churn is real.
**Risks and mitigations**
- *Risk:* recall quality regression between Python and Go.
- *Mitigation:* held-out evaluation set; ±2% recall@k threshold enforced in CI before flipping traffic.
- *Risk:* `pgx` + vchord custom-type decoder edge cases.
- *Mitigation:* prove out in Phase 1 against a small endpoint; engage vchord author if blocked.
- *Risk:* `fantasy` API churn during the migration.
- *Mitigation:* pin a version; one planned upgrade midway through the migration.
- *Risk:* central agents prove unworkable as a model and we've over-built the foundation.
- *Mitigation:* the foundation cost is incremental (fantasy ≈ raw SDK + a thin abstraction). Worst case we use fantasy for embeddings only and never grow it. No wasted bet.
## Out of Scope
- **Cross-tenant Singularity Memory** — single trust domain per deployment.
- **Retrieval-pipeline redesign** — BM25 + vector + RRF + reranker semantics are preserved exactly.
- **DB migration** — Postgres + vchord stay.
- **Public-internet endpoint** — tailnet only per ADR-013.
## Sequencing
| Phase | What | Cost |
|---|---|---|
| 0 | Prep: commit OpenAPI spec, build test suite, set up CI (per existing `TODO.md`) | 12 weeks |
| 1 | Greenfield Go scaffold parallel to Python; first endpoint (`GET /v1/banks`) | 23 weeks |
| 2 | Endpoint parity (recall is the critical gate) | 48 weeks |
| 3 | Worker + admin UI (`pony` + `ultraviolet` on `wish`) | 23 weeks |
| 4 | Central persistent-agent host (depends on sf SPEC §17 scoping) | variable |
| 5 | Python deprecation | 1 week |
Total: ~12 weeks for Phases 03 + Phase 5; Phase 4 lands when sf-side agent layer is scoped.
## References
- `MIGRATION.md` (singularity-memory repo) — implementation arm.
- `SPEC.md` §16 — Knowledge Layer.
- `SPEC.md` §1718 — Persistent Agents and Inter-Agent Messaging.
- `ADR-012` — Multi-instance federation (this is one of its surfaces).
- `ADR-013` — Network and remote-execution (deployment substrate).
- `ADR-016` — Charm AI stack adoption (frames the polyglot decision).
- `charmbracelet/charm` — KV with sync (auth/identity patterns ported here).
- `charmbracelet/fantasy` — agent runtime.
- `charmbracelet/catwalk` — provider/model registry.

View file

@ -0,0 +1,89 @@
# ADR-015: Flight recorder via `charmbracelet/x/vcr`
**Date**: 2026-04-29
**Status**: proposed (deferred — capture for staged execution)
## Context
sf today writes:
- `.sf/event-log.jsonl` — structured event stream (phase changes, tool calls, errors).
- `.sf/traces/*.jsonl` — per-unit trace spans.
- `.sf/audit/` — historical state snapshots.
These are all *structured event streams*. They're great for programmatic analysis but they don't record what the auto-loop *looked like* on the operator's terminal — the actual TUI frames, the stream of tool output, the agent's thinking, the live progress indicators.
When something goes wrong in production (the auto-loop appears to hang, an agent generates surprising output, a hook misbehaves), the operator wants to **replay the session** — see what was on screen at minute 14 — not reconstruct it from JSON.
`charmbracelet/x/vcr` records terminal output as a sequence of frames and replays them deterministically. It's the right substrate for a flight recorder.
## Decision
- **Language: Go.** Standalone service or library; integrates with sf via shared filesystem (writes recordings to `.sf/recordings/`).
- **Recording substrate: `charmbracelet/x/vcr`** — captures ANSI/VT frames into a portable file format with timestamps.
- **Trigger: every auto-loop unit dispatch records by default.** Recording is opt-out per project via `.sf/config.toml` (`[telemetry] flight_recorder = false`).
- **Storage: `.sf/recordings/{unit-id}.vcr`**, with a retention policy (default 30 days, configurable). Old recordings auto-expire on the next sweep.
- **Replay: `sf replay <unit-id>`** — opens the recording in a TUI player; supports pause, scrub, frame-step, search-by-text.
- **Format: vcr-native.** No reinventing.
## Alternatives Considered
- **`asciinema`** — well-known terminal recorder, mature tooling, JSON-based format.
- *Rejected:* asciinema runs as a subprocess wrapping the shell. Integrating with sf's auto-loop (which is the *driver*, not a child of the recorder) requires inverting the model. `vcr` is library-shaped — sf calls into it.
- **`vhs`** — Charm's CLI video recorder, used for demos.
- *Rejected:* `vhs` is for scripted demos, not live capture. Wrong tool.
- **Re-render from `.sf/event-log.jsonl`** — replay events through pi-tui to reproduce the frames.
- *Rejected:* requires keeping pi-tui forever, and rendering depends on terminal geometry that may differ from the original. Frame-accurate replay is not the same as event replay; both have value but they're different products.
- **Build a custom recorder.**
- *Rejected:* `vcr` exists. NIH-don't.
## Consequences
**Positive**
- **Frame-accurate post-mortem** — when a unit fails or the auto-loop hangs, the operator sees exactly what was on screen, including timing.
- **Onboarding artefact** — recordings of "what does sf do?" become shareable demos without scripting.
- **Audit trail for destructive ops** — admin actions in the future Charm TUI client (ADR-017) and Singularity Memory admin UI (ADR-014) can be recorded for security audit.
- **Light coupling**`vcr` is a Go library; sf's TS core invokes a small Go recorder process per unit dispatch. No tight integration with the agent loop.
**Negative**
- **Disk usage** — recordings are bigger than event logs (frame data vs. structured records). Mitigated by retention policy. Estimate: ~1MB per 10-minute unit at typical TUI density.
- **Operator-only** — frame replay isn't useful in headless contexts. Headless dispatches should disable recording (`SF_FLIGHT_RECORDER=0` env).
- **Polyglot crosses one more boundary** — sf core (TS) writes recordings via a Go subprocess. Same shape as ADR-013 (TS↔Go via stdio); manageable.
**Risks and mitigations**
- *Risk:* `vcr` API churn — it's in `charmbracelet/x` (experimental).
- *Mitigation:* pin a version; abstract recording behind an interface so a future swap is contained.
- *Risk:* Recording overhead measurably slows the auto-loop.
- *Mitigation:* benchmark before enabling-by-default. If overhead > 5%, ship as opt-in only.
- *Risk:* Sensitive data (tokens, paths, secrets) leaks into recordings.
- *Mitigation:* same redaction layer as `event-log.jsonl` — applied at the frame level before write. Enforce via a redaction filter applied to the VT stream.
## Out of Scope
- **Audio recording.** Terminal frames only.
- **Cross-host recording** — each host records its own units; flight-recorder doesn't try to stitch SSH-worker output onto orchestrator-side replay. (Each unit attempt has a `worker_host`; replay is per-host.)
- **Live remote viewing** of an in-progress recording — that's a different feature (could be Wish + Bubble Tea showing a "live" view of the auto-loop). Track separately if wanted.
## Sequencing
| When | Action |
|---|---|
| Tier 2/3 — after federation primitives land | Build a thin Go recorder process; sf core spawns one per unit dispatch. |
| Tier 3 | `sf replay <unit-id>` command — TUI player using Bubble Tea. |
| Tier 3 | Redaction filter parity with `event-log.jsonl`. |
| Tier 4 (nice-to-have) | Retention policy auto-sweep; recording bundle export (`sf recording export <unit-id>``.vcr.tar.gz` for sharing). |
## Out of Scope (continued — feature-creep guardrails)
- AI-assisted summarisation of recordings ("show me what failed in the last 5 unit attempts") — possible later via fantasy + recording metadata, but explicitly not v1.
- Web-based replay UI — server-rendered replay is a separate product surface; v1 is local TUI only.
## References
- `charmbracelet/x/vcr` — terminal recording library.
- `SPEC.md` §19 — Observability (where structured event logs and traces live).
- `ADR-016` — Charm AI stack adoption (frames why Go for new services).
- `ADR-017` — Charm TUI client (future replay UI consumer).

View file

@ -0,0 +1,110 @@
# ADR-016: Charm AI stack adoption strategy
**Date**: 2026-04-29
**Status**: accepted (strategic frame; concrete decisions in ADR-013/014/015/017)
## Context
`SPEC.md` §1 retargeted sf v3 from Go-on-Crush to TypeScript-on-pi-mono. That decision still stands for sf core. But over the past year the Charm ecosystem has matured to a point that *adjacent* services would be better served by it:
- **`fantasy`** (730 stars, pushed today) — multi-provider AI agent SDK in Go. Equivalent of `pi-ai`. Wasn't this complete at retarget time.
- **`catwalk`** (688 stars) — provider/model registry used by `crush`.
- **`crush`** (23,641 stars) — Charm's agentic coding CLI.
- **`charm`** (2,491 stars) — encrypted KV with sync, self-hostable as `charm-server`. Foundation for cross-instance state.
- **`wish`** (5,158 stars) + `wishlist` + `promwish` — full SSH-served service stack with built-in metrics.
- **`bubbletea`** (41,946 stars) + `bubbles` + `lipgloss` + `glamour` + `huh` + `harmonica` — production TUI stack.
- **`x/vt`, `x/ansi`, `x/cellbuf`, `x/mosaic`, `x/vcr`, `x/xpty`, `x/conpty`, `x/editor`, `x/sshkey`, `x/term`** — bleeding-edge primitives, all actively maintained.
- **`pony` + `ultraviolet`** — next-gen declarative TUI markup. Pre-1.0 / experimental.
- **`anthropic-sdk-go`, `openai-go`, `go-genai`** — Charm-maintained Go LLM SDKs.
The question the SPEC retarget didn't have to answer: **now that this much is here, do we migrate?**
## Decision
**Option A — Parallel build, no core migration.**
- **sf core (TypeScript on pi-mono): unchanged.** SPEC §1 retarget rationale stands. Pi-mono SDK alignment, MCP-server story, ~200+ TS files, real production users — none of it justifies a 36 month rewrite.
- **New services: Go on Charm, comprehensively.** sf-worker (ADR-013), Singularity Knowledge + Agent Platform (ADR-014), flight recorder (ADR-015), Charm TUI client (ADR-017) — all in Go using the Charm ecosystem.
- **Native engine (Rust): permanent.** ~11k LOC in `native/` (git, text, forge_parser, grep, highlight, ast, diff, etc.) is best-of-breed and not re-implementable in Go without losing performance. Bindings (napi-rs from TS today; cgo from Go for new services if needed) flex per consumer.
- **Pony adoption: now, not deferred.** Reversed from initial conservative stance. Adopting pony from day one in Phase-3 admin surfaces (Singularity Memory admin UI, future audit dashboards) — admin tolerates churn better than user-facing surfaces, and the foundation bet pays back if pony stabilises.
- **Other `charmbracelet/x/*` packages: adopted comprehensively.** When a new Go service needs a primitive (image rendering, session recording, pty, editor, input handling), use the `x/*` package. Don't reinvent.
- **Re-evaluation trigger: 12 months from first Go service in production.** If >50% of *new* sf code lands in Go services, the question of consolidating sf core becomes worth re-asking. Until then, polyglot is the right cost shape.
## Alternatives Considered
- **Option B — Soft migration, gradual rewrite.** Use Charm for new code AND opportunistically rewrite TS modules in Go when they need substantial work anyway. Eventually sf core drifts to Go.
- *Rejected:* rolling polyglot across the same logical layer is harder to reason about than per-service polyglot at clean boundaries. Some PRs would bridge languages mid-feature; CI complexity grows.
- **Option C — Big-bang migration.** Re-fork from `crush`, port sf's auto-loop, gates, planner, harness, skills doctrine into Go.
- *Rejected:* 36 months of no feature shipping; production users disrupted; loss of pi-mono SDK upstream alignment. The retarget rationale isn't fully invalidated — the only argument it relied on that's weakened is "70% of Crush is duplicated in pi-mono", and even that's still true *and* the cost of rewriting outweighs the duplication tax.
- **Status quo** — keep everything in TS, including new services.
- *Rejected:* Node ecosystem doesn't have equivalents for `wish`, `fantasy`, `charm`-server, the comprehensive TUI/SSH/AI stack Charm provides. Building these in TS would be reinventing maturer Go libraries. New services in Charm are just *easier*.
## Architectural Picture
```
[Charm TUI client] Go ← ADR-017: pony + ultraviolet + bubbles + lipgloss +
glamour + huh + harmonica + x/mosaic
[Singularity Knowledge + Agent Platform] Go ← ADR-014: charm-server + fantasy +
Postgres+vchord + pony admin
[sf-worker SSH host] Go ← ADR-013: wish + xpty/conpty + promwish
[Flight recorder] Go ← ADR-015: x/vcr
│ RPC / MCP / SSH / HTTP
[sf daemon + core] TS ← unchanged, pi-mono SDK aligned
│ napi-rs
[native engine] Rust ← permanent, ~11k LOC
```
Three languages, three clean boundaries. Each layer using the stack that fits.
## Consequences
**Positive**
- **No 36 month feature freeze** — sf core ships normally during the new-service build-out.
- **Right tool for each layer** — Go's ecosystem advantages (Wish, fantasy, charm-server) accrue without disrupting what already works in TS.
- **Strategic optionality** — pony and ultraviolet bets are localised to admin surfaces; if they fail, only those views need swapping.
- **Comprehensive adoption** beats piecemeal — using Charm's stack across multiple new services means we develop deep ecosystem familiarity, can share patterns across services, and contribute back upstream where useful.
**Negative**
- **Polyglot deployment** — TS + Go + Rust + (transitional Python during Singularity Memory migration). Three or four runtimes on a single host. Operationally manageable; not free.
- **Pi-mono SDK alignment is one-way** — Charm's stack improvements don't flow to sf core. We get pi-mono updates upstream; we don't get fantasy updates upstream-of-sf.
- **Cross-language refactors** are harder — when an interface between TS and Go needs to change, both sides need a coordinated PR. Mitigated by stable RPC/MCP/SSH-stdio contracts.
**Risks and mitigations**
- *Risk:* `fantasy` or `pony` API churn breaks builds repeatedly.
- *Mitigation:* pin versions; planned upgrade windows; pony swappable via clean view-layer separation.
- *Risk:* Charm pivots away from one of these libraries.
- *Mitigation:* Charm's stack is large and self-reinforcing; abandonment of a single piece (e.g., pony, which is experimental) is recoverable. Foundation libs (`bubbletea`, `wish`, `lipgloss`, `glamour`) are mature with strong commit cadence and unlikely to be abandoned.
- *Risk:* Re-evaluation in 12 months says "actually we should consolidate to Go" and we've now got 12 months of TS-only work to throw out.
- *Mitigation:* sf core code from now until then stays useful even if a future migration happens — it documents requirements and behaviour. Worst case it becomes the *spec* the Go rewrite implements.
## Out of Scope (explicit non-decisions to keep them from re-emerging)
- **Migrating pi-mono SDK to Go.** No.
- **Replacing `pi-tui` in sf core with a Charm TUI in-process.** No — Charm TUI is a separate client (ADR-017), pi-tui stays in core until that client reaches parity, then deprecates.
- **Adopting Crush as the agent loop.** No — pi-coding-agent stays.
- **Migrating native Rust to Go.** No — Rust is best-of-breed for what it does.
- **Self-hosting a Charm Cloud account / `charm-server` as a separate sidecar.** No — port `charm-server` patterns (auth/identity) as library code into our Go services.
## Sequencing
| When | Action |
|---|---|
| Now | This ADR captures the strategic frame. Concrete service builds tracked in 013/014/015/017. |
| 12 months from first Go service in production | Re-evaluate. Audit polyglot deployment costs vs. consolidation benefit. If >50% of new sf code is Go AND ops cost of polyglot is non-trivial AND TS sf core has shrunk substantially (post-pi-tui-deprecation), open a successor ADR proposing Option C (big-bang). |
## References
- `SPEC.md` §1 — original retarget rationale (TS-on-pi-mono over Go-on-Crush).
- `ADR-013` — Network + remote execution (concrete: sf-worker).
- `ADR-014` — Singularity Knowledge + Agent Platform (concrete: SM rewrite).
- `ADR-015` — Flight recorder (concrete: x/vcr-based).
- `ADR-017` — Charm TUI client (concrete: pi-tui replacement).
- `BUILD_PLAN.md` — tier-based execution tracking.
- Charm org: https://github.com/orgs/charmbracelet — full ecosystem inventory.

View file

@ -0,0 +1,95 @@
# ADR-017: Charm TUI client — extracting `pi-tui` out of sf core
**Date**: 2026-04-29
**Status**: proposed (deferred — capture for staged execution)
## Context
sf today bundles its TUI directly in core: `pi-tui` (~10.5k LOC of TypeScript) is loaded whenever the user interacts with sf. The TUI lives at the same architectural layer as the agent loop, the auto-loop, and the planner. This couples *what sf does* to *how it presents*.
Three forces argue for extracting the TUI:
1. **sf is becoming truly headless-first**`packages/daemon`, `packages/rpc-client`, `packages/mcp-server` already exist. CLI invocations talk to the daemon. sf can be called as an MCP backend by Claude Code, Cursor, Hermes — they're TUI-agnostic. The user-facing TUI is *one client*; it shouldn't be *baked into the engine*.
2. **The Charm TUI stack is dramatically more capable than what `pi-tui` builds today.** `bubbletea` + `bubbles` + `lipgloss` + `glamour` + `huh` + `harmonica` + `x/mosaic` (image rendering) + `x/vcr` (recording) + `pony` + `ultraviolet` (declarative markup) compose to far better UX than reproducing in TS would.
3. **Removing `pi-tui` from sf core deletes ~10k LOC of TS** — leaner core, fewer TUI-coupled assumptions in `pi-coding-agent`, cleaner test surface.
This ADR plans the extraction.
## Decision
- **Build a new `sf-tui` client in Go** using the Charm stack. Talks to the sf daemon over the existing RPC (per `packages/rpc-client`).
- **View layer: `pony` (declarative TUI markup) + `ultraviolet` (its base).** Adopted now, not deferred. Other view primitives where pony lacks coverage: `bubbles` components, `lipgloss` styling, `glamour` markdown, `huh` forms, `harmonica` animations, `x/mosaic` for inline images.
- **Two-stage replacement of `pi-tui`:**
- **Stage 1:** new `sf-tui` ships parallel to `pi-tui`. Users opt-in via `sf --tui=charm`. `pi-tui` remains the default. Both clients connect to the same daemon — they're peer clients, not replacements yet.
- **Stage 2:** when `sf-tui` reaches parity (every screen `pi-tui` has, plus the new ones the Charm stack enables), flip the default. Deprecate `pi-tui` with a warning. After two minor releases, **delete `pi-tui` entirely** — ~10k LOC of TS dropped from sf core.
- **No migration of in-flight `pi-tui` work.** Anything in `pi-tui` that hasn't shipped doesn't get backported to `sf-tui`. The new client is a clean slate.
- **Architecture: clean separation between view rendering and state/data layer.** State models live in their own package; view components consume them. If `pony` proves unworkable, the swap to plain `bubbletea` is a view-layer-only refactor.
## Alternatives Considered
- **Replace `pi-tui` in-place with a TS port of Bubble Tea.** No mature TS port exists. Even if one were started, Charm's TUI ecosystem (Bubbles, Lipgloss, Glamour, Huh, etc.) wouldn't follow.
- *Rejected:* equivalent to "rebuild the Charm stack in TS." Years of work for no advantage.
- **Embed Bubble Tea inside `pi-coding-agent` via cgo / WebAssembly.**
- *Rejected:* fragile FFI; defeats the architectural goal of separating engine from UI.
- **Keep `pi-tui` indefinitely; only build Charm TUI as an alternative for SSH access.**
- *Rejected:* leaves ~10k LOC of TS in sf core *forever* as a maintenance burden. The whole point is to delete it.
- **Don't build a new TUI; expose the daemon over MCP/HTTP and rely on third-party clients (Claude Code, Cursor) to render.**
- *Rejected:* sf's user-facing surface is the TUI when working interactively. Outsourcing it removes a major UX touchpoint we own.
## Consequences
**Positive**
- **sf core gets ~10k LOC leaner** after Stage 2.
- **Charm stack quality** comes for free — animations (`harmonica`), inline images (`x/mosaic`), markdown (`glamour`), forms (`huh`), recording (`x/vcr`).
- **Headless / API-first architecture** is cleanly visible: daemon + RPC + MCP + clients. No TUI coupled to engine.
- **Remote TUI for free** — once the client is Wish-served (could be a v3.x extension), `tailscale ssh aidev sf` opens a full TUI session over SSH. Today's `pi-tui` is local-process only.
- **Recordings of TUI sessions** — flight recorder (ADR-015) integrates with the Charm TUI naturally; `pi-tui` would need separate work to support this.
**Negative**
- **Two-language UI work during Stage 1** — bug fixes touching both `pi-tui` (TS) and `sf-tui` (Go). Bounded duration; one client retires at Stage 2.
- **Pony is pre-1.0** — API churn during the build. Acceptable per the "view layer swappable" architecture.
- **User-facing transition** — users have to relearn keybindings or layouts if `sf-tui` differs from `pi-tui`. Mitigated by explicit parity gate: `sf-tui` must match `pi-tui`'s primary views before Stage 2 flip.
- **Daemon RPC contract becomes load-bearing** — what was previously an in-process call (TS → TS) is now a cross-process call (Go → TS via RPC). Requires the RPC contract to be stable and complete; missing methods become blockers. Acceptable; this is the right architectural pressure.
**Risks and mitigations**
- *Risk:* parity gate is moved unilaterally (Stage 2 flips default before parity is real).
- *Mitigation:* parity defined explicitly as a checklist of `pi-tui` screens with their `sf-tui` equivalents and end-to-end tests passing. CI gate.
- *Risk:* `pony` proves unstable; we hit the swap-to-`bubbletea` fallback halfway.
- *Mitigation:* view layer is architected to be swappable (pony components implement an interface; bubbletea components implement the same interface). Swap is a refactor, not a rewrite.
- *Risk:* Daemon RPC has gaps that `pi-tui` papers over via in-process state access.
- *Mitigation:* audit `pi-tui`'s direct daemon-state access at the start of Stage 1; promote any in-process patterns to RPC methods.
- *Risk:* User keybindings / muscle memory breaks.
- *Mitigation:* `sf-tui` mirrors `pi-tui`'s keybindings 1:1 for the parity surface; new keybindings only for new features.
## Out of Scope
- **Web-based UI.** Could be a separate v4 project.
- **Multi-user TUI sessions** (two operators watching the same auto-loop).
- **Theme customisation.** v1 ships one theme; user theming is later.
- **Internationalisation.** v1 is English only; same posture as today.
## Sequencing
| Stage | Action | Cost | Result |
|---|---|---|---|
| Pre-stage | Audit `pi-tui` screens; produce a parity checklist. | 1 week | List of screens + features `sf-tui` must cover. |
| Stage 1 | Build `sf-tui` parallel to `pi-tui`. View on pony+ultraviolet+bubbles, state separate. Daemon RPC fills any gaps. Ships as opt-in via `sf --tui=charm`. | ~610 weeks | Two TUIs coexist. Users pick. |
| Stage 1.5 | Parity verification — every checklist item works in `sf-tui`; CI gate. | 2 weeks | `sf-tui` ready to flip default. |
| Stage 2 | Flip default to `sf-tui`. Deprecate `pi-tui` with warning on use. | 1 week + soak | `sf-tui` is canonical; `pi-tui` is legacy. |
| Stage 3 | Delete `pi-tui` after two minor releases. | 1 week cleanup | sf core sheds ~10k LOC of TS. |
Total: ~1216 weeks across stages.
## References
- `packages/daemon`, `packages/rpc-client`, `packages/mcp-server` — already exist; this ADR makes them load-bearing for clients.
- `packages/pi-tui` — the existing TUI being deprecated.
- `ADR-013` — Network: future SSH-served TUI via `wish` rides the same substrate.
- `ADR-015` — Flight recorder: `sf-tui` records its sessions naturally.
- `ADR-016` — Charm AI stack adoption (this is one of its concrete arms).
- `charmbracelet/bubbletea`, `charmbracelet/bubbles`, `charmbracelet/lipgloss`, `charmbracelet/glamour`, `charmbracelet/huh`, `charmbracelet/harmonica`.
- `charmbracelet/x/mosaic`, `charmbracelet/x/vcr`, `charmbracelet/x/editor`, `charmbracelet/x/input`.
- `charmbracelet/pony` + `charmbracelet/ultraviolet` — adopted as the view-layer foundation.

View file

@ -0,0 +1,182 @@
---
name: acquiring-skills
description: Safely discover and install skills from external repositories or other local sf projects. Use when a user asks for something where a specialised skill likely exists (browser testing, PDF processing, infra automation, etc.) and you want to bootstrap rather than start from scratch. Always verify untrusted sources with the user.
---
# Acquiring New Skills
This skill teaches how to safely discover and install skills from external sources into sf.
## SAFETY — READ THIS FIRST
Skills can contain:
- **Markdown files** — risk: prompt injection, misleading instructions.
- **Scripts** (TypeScript, Python, Bash) — risk: arbitrary code execution.
### Trusted sources (no user approval needed for download)
| Source | Why trusted |
|---|---|
| `https://github.com/anthropics/skills` | Anthropic's official Agent Skills. |
| `https://github.com/singularity-ng/singularity-forge` | sf's own repo (never download to overwrite — only as reference). |
| Local sister repos under `/home/mhugo/code/` (ace-coder, letta-workspace, dr-repo, etc.) | User-owned local code; treat as trusted but still inspect scripts. |
| `mikki-bunker:~/code/` | The user's bunker host; trusted but still inspect. |
### Untrusted sources (ALWAYS verify with user)
For ANY source other than the above:
1. Ask the user before downloading.
2. State where the skill comes from (URL, repo, author).
3. Get explicit approval.
### Script safety
Even from trusted sources, ALWAYS:
1. Read and inspect every script before executing it.
2. Understand what it does — especially network calls, file operations, system commands.
3. If it `curl | bash`, refuse to run it without the user explicitly inspecting and approving the URL.
## When to Use This Skill
### DO use when
- The user asks for something where a skill likely exists ("test this webapp", "generate a PDF report", "deploy with terraform").
- You think "there's probably a skill that would bootstrap my understanding".
- The user explicitly asks about available skills or extending sf's capabilities.
### DON'T use for
- General coding tasks you can already handle.
- Simple bug fixes or feature implementations.
- Tasks where you have sufficient knowledge.
- Anything urgent — discovery takes time; sometimes "just code it" is faster.
## Ask Before Searching (Interactive Mode)
If you recognise a task that might have an associated skill, ask first:
> "This sounds like something where a community skill might help (e.g., webapp testing with Playwright). Want me to look in `anthropics/skills` first, or start coding right away?"
The user may prefer to start immediately rather than wait.
Only proceed with skill acquisition if the user agrees.
## Skill Repositories
| Repository | Description |
|---|---|
| `https://github.com/anthropics/skills` | Anthropic's official Agent Skills. |
| `~/code/ace-coder/.claude/skills/` (local) | ACE skills the user has on disk. |
| `~/code/letta-workspace/.agents/skills/` (local) | Letta workspace skills. |
| `mikki-bunker:~/code/letta-workspace/letta-code/skills/` (remote) | Letta Code skills on the bunker host. |
| `mikki-bunker:~/code/singularity-package-intelligence/.claude/skills/` (remote) | Generic agent skills on bunker. |
Browse repo READMEs for skill listings. Don't download blind — pick the one that matches.
## Installation Locations in sf
| Location | Path | When to use |
|---|---|---|
| **Project (sf core)** | `src/resources/extensions/sf/skills/<skill>/` | Skills bundled with sf, available to every sf install. Default for general-purpose skills. |
| **Per-project bundled** | `<other-repo>/.sf/skills/<skill>/` | Skills useful only inside a specific project. |
| **User-local** | `~/.sf/skills/<skill>/` | User-only skills not committed to a repo. |
**Default**: Project (sf core) for skills that benefit anyone running sf. Per-project for things only that project's contributors need.
## Naming Conventions
Before installing, ensure the skill follows sf naming:
- Lowercase kebab-case directory name.
- Match the directory name exactly to the `name:` field in frontmatter.
- No prefixes like `dr-`, `ace-`, `gsd-` — strip them. (`dr-spec-first-tdd``spec-first-tdd`.)
- See [`creating-skills`](../creating-skills/SKILL.md) for the full convention.
## How to Acquire
### Method 1 — Clone to `/tmp`, inspect, copy
```bash
# 1. Clone the repo (shallow)
git clone --depth 1 https://github.com/anthropics/skills /tmp/skills-temp
# 2. Inspect the skill you want
cat /tmp/skills-temp/skills/webapp-testing/SKILL.md
ls /tmp/skills-temp/skills/webapp-testing/scripts/ # if any
# Read every script before running anything
# 3. Copy to sf (default location)
cp -r /tmp/skills-temp/skills/webapp-testing \
/home/mhugo/code/singularity-forge/src/resources/extensions/sf/skills/
# 4. Cleanup
rm -rf /tmp/skills-temp
```
### Method 2 — rsync from another local repo
```bash
rsync -av ~/code/ace-coder/.claude/skills/ace-systematic-debugging/ \
/tmp/bunker-skills/systematic-debugging/
# Inspect, then port (rename without ace- prefix, sf-adapt tooling references)
```
### Method 3 — rsync from bunker (over SSH)
```bash
mkdir -p /tmp/bunker-skills
rsync -av -e ssh \
mikki-bunker:'~/code/letta-workspace/letta-code/skills/<skill>/' \
/tmp/bunker-skills/<skill>/
```
After fetching, **adapt for sf**:
- Strip foreign prefixes (`dr-`, `ace-`, `gsd-`, `letta-`).
- Replace foreign tooling references (Letta MCP tool calls, claude-flow CLIs) with sf-native equivalents (`rg`, `npm test`, `sf_*` tools, `advisory-partner` skill, etc.).
- Drop bootstrap gates that don't apply (`onboarding()`, `IN_NIX_SHELL`, etc.).
- Cite sf doctrine: `AGENTS.md`, `docs/SPEC_FIRST_TDD.md`, the relevant sister skill.
See [`creating-skills`](../creating-skills/SKILL.md) for the conventions adapted skills must follow.
## Registering the New Skill
Skills under `src/resources/extensions/sf/skills/` are auto-discovered on the next sf launch — no manual registration.
For per-project skills under `<repo>/.sf/skills/`, check `auto-loop`/`bootstrap` logs to confirm discovery.
## Complete Example
User asks: "Can you help me test my React app's UI?"
1. **Recognise opportunity**: webapp testing — likely has an Anthropic skill.
2. **Ask user**: "Want me to look for a webapp-testing skill in `anthropics/skills`, or start coding now?"
3. **If user agrees, fetch**:
```bash
git clone --depth 1 https://github.com/anthropics/skills /tmp/skills-temp
cat /tmp/skills-temp/skills/webapp-testing/SKILL.md
ls /tmp/skills-temp/skills/webapp-testing/scripts/
# Read each script
```
4. **Adapt for sf**: rename if needed, strip foreign tooling, point doctrine references at sf docs.
5. **Install**:
```bash
cp -r /tmp/skills-temp/skills/webapp-testing \
/home/mhugo/code/singularity-forge/src/resources/extensions/sf/skills/
rm -rf /tmp/skills-temp
```
6. **Use**: `Skill(skill: "webapp-testing")`.
## Rules
- **Read every script before executing it.** No exceptions, even from trusted sources.
- **Don't `curl | bash`** unless the user has personally inspected and approved the URL.
- **Untrusted sources require explicit user approval** before download.
- **Strip foreign prefixes** when porting (`dr-`, `ace-`, `gsd-`, `letta-`).
- **Adapt tooling references** to sf-native equivalents.
- **Cite sf doctrine** — link `AGENTS.md` and `docs/SPEC_FIRST_TDD.md` rather than restating their rules.
- **Don't overwrite an existing sf skill** without diffing first; if names collide, decide whether to merge, supersede, or rename.

View file

@ -0,0 +1,163 @@
---
name: brainstorming
description: Use before any sf feature, fix, or behaviour change. Anchors the idea in the SPEC → milestone → slice → task path, checks what already exists in the codebase and memory, and produces an approved design before any code is written.
---
# Brainstorming
## Purpose
The first thinking step before code. Anchor the change in sf's planning hierarchy, find what already exists, debate with adversarial review, and emerge with an approved design that `plan-slice` or `spec-first-tdd` can act on.
If you skip this and jump to code, you risk: rebuilding what exists, missing a real consumer, locking in the wrong contract, or shipping a "professional" suggestion with no caller.
## When to Run
- A non-trivial feature or behaviour change is requested.
- A milestone is being proposed and the strategic frame is unclear.
- An upstream port has architectural ambiguity that `clarify-spec` couldn't resolve.
For trivial changes (typo fix, dependency bump, lint cleanup), skip this skill and go straight to `spec-first-tdd`.
## Skill Chain
```
← prev: (entry point — user request, sf auto-mode trigger, or new milestone)
→ next: clarify-spec (if underspecified) → plan-slice → spec-first-tdd
```
Side-chain: invoke `systematic-debugging` on any bug/failure during the design process.
## What This Skill Produces
An approved design covering:
- **Purpose** — why this exists.
- **Consumer** — production code path that depends on it.
- **Contract** — observable behaviour the test will pin.
- **Implementation sketch** — where it lives, what it touches.
- **Test strategy** — what kind of test, what falsifier.
- **Evidence plan** — how we'll know it's actually working post-deploy.
- **Scope defence** — what tempting expansion this slice refuses.
## What This Skill Refuses
- Writing code, scaffolding files, or invoking implementation skills.
- Designing without first checking what already exists in sf.
- Skipping advocate/challenger on non-trivial decisions.
## Hard Gate
Do NOT advance past this skill until:
1. The idea is anchored in sf's planning hierarchy (SPEC → milestone → slice).
2. You've checked what exists — code search, memory recall, requirements scan.
3. You've identified a real production consumer.
4. A design has been presented and approved (or, in auto-mode, satisfies the portfolio-approved envelope).
## Step 1 — Anchor in sf's Planning Hierarchy
Every change connects to a real system need. Establish:
- Which `SPEC.md` section, `REQUIREMENTS.md` Active item, or `BUILD_PLAN.md` row does this serve?
- Is there an existing milestone/slice tracking this? (`.sf/milestones/`, `.sf/active/`)
- Who is the **real consumer** — what production code path depends on this?
```bash
rg -nF "<symbol or feature name>" src/ packages/
ls .sf/milestones/ 2>/dev/null
sf_milestone_status # if running inside sf
```
Search prior memory:
```
sf_search_memories(query="<topic of work>", limit=8)
```
If no consumer exists in production code, stop. There's nothing to build for nobody.
## Step 2 — Check What Already Exists
Discovery happens before design.
```bash
# Code surface
rg -nF "<feature name|symbol|route>" src/ packages/
rg -ln "<concept-keyword>" src/resources/extensions/sf/
# Sibling implementations / patterns to reuse
rg "function <similar>" src/resources/extensions/sf/
# Skill registry
ls src/resources/extensions/sf/skills/
```
Use `Explore` subagents only when discovery legitimately fans out into 3+ independent search angles. For one targeted question, do it inline.
Collect 2+ concrete repo facts before debate. Label:
- `Observed:` directly from code, tests, traces, or memory.
- `Inferred:` conclusion supported by observed evidence.
- `Proposed:` design choice not yet validated.
## Step 3 — Clarify (one question at a time)
If meaningful ambiguity remains, ask the highest-impact unknowns first. Multiple-choice preferred.
If many things are unclear, hand off to `clarify-spec` instead of repeatedly bouncing questions in this skill.
## Step 4 — Propose 23 Approaches
For each: SPEC fit, trade-offs, existing sf machinery reused. Lead with the recommended option and state why.
For non-trivial bugs / runtime fixes, name one repro/debug path before code, one after, and explain why traces alone are insufficient.
## Step 5 — Advocate and Challenger
For non-trivial decisions, run an adversarial pass via [`advisory-partner`](../advisory-partner/SKILL.md):
1. **Advocate** — strengthen the best version of the design. Argue for it.
2. **Challenger** — attack the design AND propose an alternative. A challenger anchored to the advocate's framing is not adversarial.
3. **Falsifier** (required gate, blocks Step 6): `FALSIFIER: this design is wrong if [specific observable condition]`. Generic falsifiers ("wrong if it doesn't work") are process failures.
Stop the loop when:
- The preferred design has a clear falsifier and survived it.
- The challenger objection is answered or accepted as residual risk.
- Another loop would only restate the same arguments.
This is the default for non-trivial decisions — do not ask whether to use it.
## Step 6 — Present Design
Cover: purpose, consumer, contract, implementation sketch, test strategy, evidence plan, scope defence.
When approved, persist to memory so the next session can find it:
```
sf_save_memory(
category="design-decision",
content="design: <what> for <consumer> — approach: <key decision> — refused: <scope defence>",
confidence=0.9
)
```
## Execution-Ready Checklist (gate to plan-slice / spec-first-tdd)
- [ ] Consumer identified — production caller named.
- [ ] Advocate + challenger done (or trivial — explicitly waived).
- [ ] Files and boundaries concrete — specific paths, not "edit the router".
- [ ] Blast radius estimated (callers count, modules touched).
- [ ] Test strategy named (test type + which behaviour the contract pins).
- [ ] Falsifier specified.
- [ ] Scope defence stated — what this slice explicitly refuses to do.
All items must pass before transitioning out of this skill.
## Key Rules
- **Purpose before artefacts** — name the consumer before naming files.
- **Prefer existing sf machinery**`auto-loop`, `verification-gate`, `advisory-partner`, `memory-store` — over inventing new layers.
- **Runtime evidence contradicts docs → trust runtime, fix docs.**
- **YAGNI** — remove hypothetical features from every design. If three real consumers don't exist, the abstraction shouldn't either.

View file

@ -0,0 +1,97 @@
---
name: clarify-spec
description: Use before plan-milestone or plan-slice when a feature or change request is underspecified. Resolve high-impact ambiguity around scope, consumers, security, failure handling, and acceptance criteria before writing the implementation plan.
---
# Clarify Spec
## Purpose
Use after the rough feature idea exists but before technical planning starts.
The job: reduce ambiguity that would otherwise cause bad plans, wrong tests, or rework. A wrong plan based on confident-but-wrong assumptions costs more than the few minutes to clarify.
## When to Run
- A user request lands without a clear consumer.
- A milestone goal is "make it better" or "robust" or "fast" — vague verbs that aren't testable.
- A slice plan is being drafted but key boundaries are unstated.
- A change touches a security/auth surface and the threat model isn't named.
- An upstream port (pi-mono / gsd-2) leaves architectural intent ambiguous after reading the commit.
If the request is concrete and the consumer is obvious, skip this skill — go straight to `brainstorming` or `spec-first-tdd`.
## Load First
- [`docs/SPEC_FIRST_TDD.md`](../../../../../../docs/SPEC_FIRST_TDD.md) — the constitution.
- [`SPEC.md`](../../../../../../SPEC.md) — sf v3 specification, if relevant.
- `.sf/REQUIREMENTS.md` — current Active requirements.
- `.sf/DECISIONS.md` — locked decisions that constrain the answer.
- `.sf/PROJECT.md` — project intent.
## Clarification Priorities
Resolve in this order — highest value first, lowest second:
1. **Primary user / operator** — who initiates this in production?
2. **Production consumer** — which code path depends on it? (`rg` for callers if unsure)
3. **In scope vs out of scope** — what is this change *not* covering?
4. **Failure expectations** — what happens when dependencies fail; is degradation visible to the user?
5. **Security expectations** — auth, tokens, command injection, secret exposure surface.
6. **Measurable acceptance criteria** — observable behaviour, not vague verbs.
If the change touches secrets, auth, or sandbox/permission boundaries: clarify the threat model explicitly. If it touches the auto-loop or worktree management: clarify recovery semantics on crash.
## Question Rules
- **One question per turn.** Avoid serial questioning if the user signals impatience.
- **Prefer multiple-choice** over open-ended when you can enumerate plausible answers — easier to answer, easier to record.
- **Highest-impact unknowns first.** Don't ask about technical stack choices unless they affect correctness.
- **Don't ask low-value style questions.** Style is a follow-up; correctness comes first.
- **Make reasonable assumptions on low-impact details** and state them — `assuming X unless you say otherwise`. The user can correct cheaply.
## Output
Once enough clarity exists for safe planning, produce:
```markdown
## Clarified Spec — <feature/change name>
- **Goal**:
- **Primary user/operator**:
- **Production consumer**:
- **In scope**:
- **Out of scope**:
- **Security expectations**:
- **Failure handling expectations**:
- **Acceptance criteria** (testable):
- [ ] ...
- **Open questions still deferred** (with default assumption):
- ...
```
For substantial existing-feature changes, also write a change proposal:
`.sf/active/<unit-id>/proposal.md` (or, before a milestone exists, `.sf/proposals/YYYY-MM-DD-<slug>.md`).
The proposal lists requirement deltas:
- `ADDED:` new requirement or capability.
- `MODIFIED:` existing requirement contract changed.
- `REMOVED:` previously required behaviour deleted.
- `RENAMED:` symbol/file/route renamed (with old → new mapping).
## Stop Conditions
Stop clarifying once:
- The next planning step is safe — `plan-milestone` or `plan-slice` will not produce a wrong plan from these answers.
- All remaining unknowns are low-impact and can be defaulted.
- Continued questioning would drift into solution design (that belongs in `brainstorming`).
## Rules
- Do not guess where ambiguity changes scope or safety.
- Do make reasonable assumptions on low-impact details — and state them.
- Stop clarifying once the next planning step is safe.
- Record the clarified spec into the unit's context before starting `plan-milestone` so the planner builds on the right contract.

View file

@ -0,0 +1,144 @@
---
name: context-doctor
description: Identify and repair degradation in sf's context — the persistent knowledge base under `.sf/`, the system prompt, the loaded skills, and the memory store. Use when the agent feels confused, instructions seem ignored, or `.sf/` files have grown bloated, redundant, or contradictory. Complements sf's runtime doctor (`doctor-history.jsonl`).
---
# Context Doctor
Over time, sf's persistent context degrades. `.sf/CODEBASE.md` accumulates stale references; `.sf/KNOWLEDGE.md` and the `memories` table fill with redundant or contradictory entries; skills overlap. This skill diagnoses and repairs.
This is the persistent-knowledge counterpart to sf's existing **runtime doctor** (`doctor-history.jsonl`, `bootstrap/doctor`). Runtime-doctor checks live process health (DB, native engine, providers); context-doctor checks the *knowledge* sf carries between sessions.
## When to Run
- The agent repeatedly contradicts itself across sessions.
- `.sf/` files have grown beyond useful or contain obvious duplicates.
- The system prompt feels bloated (too many always-loaded files).
- A specific skill seems to confuse rather than help.
- After a major refactor, when paths and structures in `.sf/` no longer match the code.
## Operating Procedure
### Step 1 — Identify Issues
Read each persistent context file and judge:
| File | Question to ask |
|---|---|
| `.sf/PROJECT.md` | Does it still match what the project actually is? |
| `.sf/CODEBASE.md` | Are file paths and module references still valid? |
| `.sf/KNOWLEDGE.md` | Any duplicates, stale conclusions, or contradictions? |
| `.sf/PREFERENCES.md` | Still reflective of how the user wants to work? |
| `.sf/DECISIONS.md` | Any superseded decisions still present? |
| `.sf/REQUIREMENTS.md` | Active requirements still active? Anything done not marked done? |
| `.sf/PM-STRATEGY.md` | Still aligned with current direction? |
| `MEMORY.md` (root) | Index lines still pointing at extant files? |
For the memory store (the `memories` table in `.sf/sf.db`):
```bash
sqlite3 .sf/sf.db "SELECT category, content, confidence, hit_count FROM memories ORDER BY confidence DESC LIMIT 20"
sqlite3 .sf/sf.db "SELECT COUNT(*) FROM memories WHERE confidence < 0.5"
```
Look for:
- Low-confidence rows (`< 0.5`) that haven't been hit in N days — candidates for archival.
- Multiple memories saying the same thing in different words — dedupe.
- Memories whose `content` references a path that no longer exists.
### Step 2 — Categorise the Decay
| Decay type | Symptoms | Fix |
|---|---|---|
| **Bloat** | `.sf/CODEBASE.md` is 5x its useful size; same fact stated 4 times. | Compress: keep one canonical statement, delete the rest. |
| **Stale** | A file references `extensions/gsd/` (renamed to `extensions/sf/`). | Update; or, if the fact is now self-evident from the code, delete. |
| **Contradiction** | `.sf/DECISIONS.md` says "use bun" but `AGENTS.md` says "npm canonical". | Find the canonical source (usually `AGENTS.md` for sf), fix the other. |
| **Orphaned** | A reference points to a file that was deleted. | Delete the reference, or restore the file if it should still exist. |
| **Skill overlap** | Two skills try to do the same job. | Either merge them or scope each to its distinct sub-case. |
| **Always-loaded bloat** | Files imported into the system prompt blow the budget. | Move stable facts out of always-loaded; rely on memory recall. |
### Step 3 — Plan the Fixes
Before editing, list:
- What you're keeping.
- What you're consolidating (which two/three things become one).
- What you're deleting outright.
- What you're moving (e.g. from always-loaded to recall-on-demand).
For non-trivial cleanups, present this plan to the user before executing — sf's persistent context is sensitive and silent edits to `DECISIONS.md` or `REQUIREMENTS.md` can erase intent.
### Step 4 — Execute Repairs
Apply edits with `Edit`. For the `memories` table, prefer `sf_save_memory` / `sf_delete_memory` (or whatever the current sf memory tools are) over direct `sqlite3` writes — those bypass tool-level invariants.
**Scope rules:**
- **You may** refine, tighten, restructure to improve signal — but do not change the *intended semantics*.
- **You may not** alter user-identity facts (the human's name, stated goals) without explicit ask.
- **You may not** rewrite locked decisions in `.sf/DECISIONS.md` without their `superseded-by` link.
- Protected files (`SPEC.md`, `BUILD_PLAN.md`, `AGENTS.md`, `CLAUDE.md`, `docs/SPEC_FIRST_TDD.md`, ADRs) require human approval before context-doctor edits them. List proposed edits, don't apply.
### Step 5 — Verify
After repair:
- [ ] Read each touched file end-to-end. Does it now read cleanly?
- [ ] Cross-references still resolve.
- [ ] No two files claim authority over the same fact.
- [ ] The system prompt token budget is healthy (target ~10% of context).
- [ ] No semantic changes to persona, user identity, or behavioural instructions slipped in.
### Step 6 — Commit
```bash
cd /home/mhugo/code/singularity-forge
git status --short
git diff
git add .sf/<files-changed> # only what you actually changed
git commit -m "doctor(context): <summary of repair>"
```
Do not push automatically — let the user review the diff first.
## Common Issues and Fixes
### System-prompt bloat
If the always-loaded files (system prompt + always-loaded skills) push past 10% of the context window:
- Move stable, low-frequency facts into the memory store.
- Move long examples into `references/` of the relevant skill.
- Demote low-value always-loaded skills to on-demand only.
### Skill overlap
If two skills' descriptions overlap so much the agent can't decide which to use:
- Compare their workflows. If 80% identical, merge. Keep the better name.
- If they target different *phases* (e.g. one is pre-execution, one post-), make the descriptions explicit about phase.
### Memory-store overgrowth
If the `memories` table has thousands of rows:
- Archive low-confidence, never-hit rows older than 30 days.
- Dedupe by content similarity (cheap heuristic: identical first 80 chars).
- Reset `hit_count` on rows that haven't been recalled in 90 days — let them earn their slot.
### Decision drift
If `.sf/DECISIONS.md` and `docs/dev/ADR-*.md` diverge:
- ADRs are the human-readable trail; DECISIONS.md is sf's tool-managed copy.
- Use `sf_decision_save` to update DECISIONS.md (it regenerates the file). Update the matching ADR by hand.
- If they're already drifted, write a synthesis ADR pointing at both and supersede the older entries.
## Rules
- **Ask the user about goals, not implementation.** sf's context is for sf — don't ask the user about structural preferences for `.sf/` files. Ask how they want sf to *behave*.
- **Don't blind-apply edits to protected files.** List, propose, wait.
- **Verify before committing.** Read each touched file end-to-end after edits.
- **Don't auto-push.** The user reviews context-doctor diffs before they go to remote.
- **Repair, don't rewrite.** A 50-line edit to a 500-line file beats a full rewrite — the diff is the audit trail.

View file

@ -0,0 +1,235 @@
---
name: creating-skills
description: Guide for creating effective skills inside sf. Use when writing a new skill or significantly updating an existing one. Covers concise design, progressive disclosure, sf-specific conventions, and what NOT to include.
---
# Creating Skills
This skill teaches how to write effective skills under `src/resources/extensions/sf/skills/`. Skills are modular, self-contained packages that extend the agent's capabilities with specialised workflows, domain knowledge, and procedural patterns.
For the broader Agent Skills specification, see [agentskills.io/specification](https://agentskills.io/specification). This skill covers sf-specific conventions and design patterns on top of that.
## Core Principles
### Concise is Key
The context window is a public good. Skills share it with everything else: system prompt, conversation history, the actual user request, *other* skills' metadata.
**Default assumption: the agent is already very capable.** Only add context the agent doesn't already have. Challenge each piece of information:
- "Does the agent really need this explanation?"
- "Does this paragraph justify its token cost?"
Prefer concise examples over verbose explanations. A skill that reads like a textbook is bloating the context.
### Set Appropriate Degrees of Freedom
Match specificity to the task's fragility and variability:
| Freedom | Use when | Example |
|---|---|---|
| **High** (text instructions) | Multiple approaches valid; decisions depend on context | "Choose the test type that matches the behaviour." |
| **Medium** (pseudocode / parameterised scripts) | A preferred pattern exists; some variation OK | "Use this template; adjust the model based on cost." |
| **Low** (specific scripts, fixed parameters) | Operations are fragile; consistency is critical | "Run exactly: `npm run typecheck:extensions`." |
Think of the agent walking a path: a narrow bridge with cliffs needs guardrails (low freedom); an open field allows many routes (high freedom).
## Anatomy of a Skill
```
src/resources/extensions/sf/skills/<skill-name>/
├── SKILL.md # Required: metadata + instructions
├── scripts/ # Optional: executable code (TS/bash)
├── references/ # Optional: docs loaded into context as needed
└── assets/ # Optional: templates, fixtures used in output
```
### `SKILL.md` (required)
Contains:
- **Frontmatter (YAML)**`name` and `description`. These two fields are *all* the agent sees to decide whether to invoke the skill, so be specific and comprehensive.
- **Body (Markdown)** — instructions and guidance. Loaded only after the skill triggers (if at all).
### Scripts (`scripts/`)
Executable code for tasks needing deterministic reliability or repeatedly rewritten logic.
- **When to include**: same code is being rewritten repeatedly; deterministic reliability needed.
- **Example**: `scripts/score-spec.ts` for computing a numeric quality score on a markdown document.
- **Benefits**: token-efficient (script is *executed*, not read into context); deterministic; consistent.
- **Note**: scripts may still need to be read by the agent for patching or environment-specific tweaks.
### References (`references/`)
Docs loaded into context as needed.
- **When to include**: documentation the agent should reference while working on a specific sub-task.
- **Examples**: `references/phase-machine.md`, `references/provider-routing.md`, `references/sops-secrets.md`.
- **Benefits**: keeps `SKILL.md` lean; loaded only when the agent decides it's needed.
- **Best practice**: if a file is large (>10k words), include grep search patterns in `SKILL.md` so the agent can pinpoint sections without loading the whole file.
### Assets (`assets/`)
Files not loaded into context, but used in the output the skill produces.
- **Examples**: `assets/spec-template.md`, `assets/jsdoc-purpose-snippet.ts`.
- **Benefits**: separates output resources from documentation; copied or modified into the result.
## Progressive Disclosure
Skills use a three-level loading system:
1. **Metadata (name + description)** — always in context (~100 words).
2. **`SKILL.md` body** — loaded when the skill triggers (target <500 lines, <5k words).
3. **Bundled resources** — loaded as needed; scripts can execute without reading.
### Patterns
**Pattern 1 — high-level guide with references**
```markdown
# Doctrine
## Quick start
<core workflow>
## Advanced
- **Subsystem A**: see [REFERENCE-A.md](references/A.md)
- **Subsystem B**: see [REFERENCE-B.md](references/B.md)
```
**Pattern 2 — domain-specific organisation**
For multi-domain skills, split by domain to avoid loading irrelevant context:
```
porting-from-upstream/
├── SKILL.md (overview + which-upstream selection)
└── references/
├── pi-mono.md (cherry-pick patterns)
├── gsd-2.md (manual port + naming translation)
└── bunker.md (skill harvest from remote host)
```
**Pattern 3 — conditional details**
Show core content; link to advanced:
```markdown
## Standard slice
<core flow>
**For multi-file slices:** see [WAVE.md](references/WAVE.md).
**For runtime/provider slices:** see [PROVIDER.md](references/PROVIDER.md).
```
### Important guidelines
- **Avoid deeply nested references.** Keep references one level deep from `SKILL.md`. Linking from `references/A.md` to `references/B.md` is fine; further is hard to discover.
- **Structure long reference files.** Files >100 lines should have a table of contents at the top so the agent can preview scope.
- **Avoid duplication.** Information lives in either `SKILL.md` or a reference file, not both. Prefer references for detail; keep `SKILL.md` for procedure.
## What NOT to Include
A skill should contain only essential files that directly support its functionality. Do **not** create:
- `README.md`
- `INSTALLATION_GUIDE.md`
- `QUICK_REFERENCE.md`
- `CHANGELOG.md`
- Any meta-documentation about the skill-creation process.
The skill exists for the agent to do the job. Auxiliary context about how the skill was made adds clutter.
## Frontmatter Conventions
```yaml
---
name: <kebab-case-skill-name>
description: <one-sentence what + when. Include trigger keywords.>
---
```
### `name`
- Lowercase letters, digits, hyphens.
- Must match the directory name exactly.
- Use *gerund* form for action skills: `creating-skills`, `dispatching-coding-agents`, `clarifying-spec`.
- Use *noun* form for doctrine skills: `code-review`, `advisory-partner`, `spec-first-tdd`.
- Max 64 chars.
### `description`
- Third person: "Dispatches stateless coding agents…", not "I help you dispatch…".
- State both *what* it does AND *when* to use it.
- Include trigger keywords the agent will recognise — phrases the user is likely to say.
- Max 1024 chars.
Example:
```yaml
---
name: dispatching-coding-agents
description: Dispatch stateless coding agents (Claude Code, Codex) via Bash as subagents. Use when stuck on a hard problem, want a second opinion, or need parallel research. They have no memory — provide all context.
---
```
## Body Writing Guidelines
- **Imperative or infinitive form.** "Use git worktrees" not "You should use git worktrees".
- **Tables for choices**, prose for procedure. Tables compress; prose expresses sequence.
- **Concrete file paths** in examples. `src/resources/extensions/sf/auto/loop.ts:140` beats "the auto-loop file".
- **Cross-link siblings.** When this skill hands off to another, link to it: `[finish-and-verify](../finish-and-verify/SKILL.md)`.
- **Cite the doctrine.** When a rule comes from `docs/SPEC_FIRST_TDD.md` or `AGENTS.md`, link those — don't restate them.
## Skill Creation Process
### 1. Understand with concrete examples
Before drafting, collect 23 concrete examples of how the skill will be used.
- "What user phrases should trigger this?"
- "What does the agent need to do that it can't already?"
- "What's the worst-case failure if this skill is wrong?"
If you can't name 2 concrete examples, the skill is probably premature.
### 2. Plan reusable contents
For each example, identify what scripts / references / assets would help repeated execution. The "is this code rewritten every time?" test is a good signal a `script` is worth bundling.
### 3. Initialise the skill
```bash
mkdir -p src/resources/extensions/sf/skills/<name>/{scripts,references,assets}
$EDITOR src/resources/extensions/sf/skills/<name>/SKILL.md
```
Delete unused subdirs — most skills only need `SKILL.md`.
### 4. Write the body
Open with one paragraph stating purpose. Then *when to run*. Then procedure (numbered steps if sequential, sections if not).
End with rules / red flags / cross-links to siblings — not a "summary" or "conclusion" section.
### 5. Test the skill
Trigger it on a real task. Notice what's confusing, what's missing, what's redundant. Iterate.
### 6. Iterate
Most skills get 35 iterations after first use. Common revisions:
- Tighten the description (the trigger keywords were wrong).
- Move detail from `SKILL.md` into `references/` (body is bloating).
- Add an example you wish you'd had on first use.
## Rules
- **Tools the skill assumes** must exist in sf. Don't reference Letta-MCP tools, claude-flow CLIs, or anything else not in this repo.
- **Cite real paths.** Every file path in a skill should resolve.
- **Sibling links must work.** When a skill says "see X-skill", that skill must exist.
- **Rule of three.** Don't write a skill until you've done the same task three times. The first two are how you learn what's stable.
- **Don't write a skill the agent can derive from existing context.** A skill that only lists files in `src/` is worse than `ls`.

View file

@ -0,0 +1,242 @@
---
name: dispatching-subagents
description: Dispatch sf's internal subagents — single, parallel, debate, or chain. Use when stuck on a hard problem, need parallel research from multiple lenses, or want adversarial / role-based debate before locking a decision. Stays inside sf's own agent fabric — no shelling out to external CLIs.
---
# Dispatching Subagents
sf has a built-in subagent fabric. Use the `subagent` tool to spawn sub-agents inside the same sf session: single delegation, parallel research, bounded debate rounds, chain pipelines, and swarm-style dispatch with parent synthesis. They share sf's model routing, allowed providers, memory store, and MCP tools — they are *sf agents*, not external coding-agent CLIs.
This skill is sf-internal only. **Do not** shell out to external `claude`, `codex`, or other coding-agent CLIs from inside sf — that breaks the harness boundary, bypasses sf's model routing, and loses traceability into `.sf/traces/`.
## When to Dispatch
| Need | Pattern |
|---|---|
| Hard problem, want fresh eyes | Single subagent (validation tier). |
| Multiple hypotheses to explore | Parallel batch — one subagent per hypothesis. |
| Adversarial review of a plan | Debate mode with advocate + challenger, or [`advisory-partner`](../advisory-partner/SKILL.md). |
| Multi-stakeholder critique of a milestone roadmap | Debate mode or parallel swarm: PM / User / Combatant / Architect / Specialist (58 subagents). |
| Pre-execution gate evaluation | sf's built-in Q3 / Q4 gates — already wired in `gate-evaluate.md`. |
| Post-execution milestone review | sf's `validate-milestone` — already 3 parallel reviewers. |
Don't dispatch a subagent for tasks the parent agent can do in 23 tool calls. Subagent overhead beats parent-agent work only when the task is large enough or the parallelism actually buys something.
## The `subagent` Tool
sf's `subagent` tool dispatches one or more sub-agents that share the parent session's allowed providers, memory store, and tool surface, but run with their own context and model selection.
### Single subagent
```
subagent({
agent: "worker",
task: "<task instructions, full context, expected output>"
})
```
The subagent runs to completion and returns its final output as a string. No context inheritance from the parent — provide everything in the prompt.
### Parallel batch (swarm)
```
subagent({
tasks: [
{ agent: "reviewer", task: "Advocate for the design. Cite repo evidence. ..." },
{ agent: "reviewer", task: "Challenge the design. Cite repo evidence. ..." },
{ agent: "security", task: "Audit the design for security failure modes. ..." },
{ agent: "tester", task: "Find missing proof and test coverage. ..." }
]
})
```
All tasks run concurrently. The tool returns one result per task, preserving task order and agent names. Use `tasks` whenever you can — sf's auto-loop already accounts for parallel subagent budgets.
### Debate batch
```
subagent({
mode: "debate",
rounds: 3,
tasks: [
{ agent: "reviewer", task: "Advocate for <design>. Cite repo evidence." },
{ agent: "reviewer", task: "Attack <design>'s strongest assumption and propose an alternative." },
{ agent: "planner", task: "Moderate from architecture and delivery risk. End with a recommendation." }
]
})
```
Debate mode runs bounded rounds. In each round, every participant sees the
previous rounds' transcript, so the challenger can engage the advocate's
strongest defence instead of firing a parallel monologue.
Use debate mode for high-stakes decisions, plan review, architecture review,
and migrations where the cost of a weak plan is high. Keep `rounds` between 2
and 3 by default; max is 5.
### Agent selection and model overrides
sf routes subagents through agent definitions in `src/resources/agents/`,
`~/.sf/agent/agents/`, or project `.sf/agents/`. The actual tool schema uses
`agent`, `task`, optional per-task `model`, optional `cwd`, plus batch-level
`mode`/`rounds` for debates.
- `planner` — architecture and implementation planning; conflicts with active sf planning phases.
- `scout` — fast codebase recon.
- `researcher` — web/current-info research.
- `reviewer` — independent code/design review.
- `tester` — tests and coverage gaps.
- `security` — security audit.
- `worker` — general-purpose execution.
Use `model` only when you need an explicit override:
```
{ agent: "reviewer", task: "...", model: "claude-sonnet-4-5" }
```
## Patterns
### Pattern 1 — Parallel research
Three subagents investigate three hypotheses simultaneously; the parent synthesises:
```
subagent({
tasks: [
{ agent: "scout", task: "Investigate whether the claim() race could leak past the conditional UPDATE under SQLite's WAL semantics. Cite files and line numbers. ..." },
{ agent: "scout", task: "Investigate whether claim_until expiry handles clock skew between hosts. ..." },
{ agent: "scout", task: "Investigate whether a crashed worker's claim can be resurrected after a stale-lock cleanup. ..." }
]
})
```
Each returns a paragraph; the parent picks the highest-confidence finding.
### Pattern 2 — Adversarial review (advocate + challenger)
For non-trivial decisions, pressure-test before locking:
```
subagent({
mode: "debate",
rounds: 3,
tasks: [
{ agent: "reviewer", task: "Argue the strongest case FOR <design>. Cite repo evidence. End with: ADVOCATE_VERDICT: <PROCEED | CAVEAT>." },
{ agent: "reviewer", task: "Attack <design>'s strongest assumption AND propose an alternative. Anchored attacks don't count — propose a different framing. End with: CHALLENGER_VERDICT: <design wrong if [observable condition]>." }
]
})
```
The parent synthesises the advocate's strongest support, the challenger's strongest objection, and either answers the objection or accepts it as residual risk. Debate mode gives later rounds the prior transcript.
This is the pattern [`advisory-partner`](../advisory-partner/SKILL.md) wraps. Use that skill rather than reinventing if the subject is a plan or decision being reviewed.
### Pattern 3 — Multi-role swarm
For high-stakes milestone planning, spawn a stakeholder swarm (the same roles as the Vision Alignment Meeting in `plan-milestone`):
```
subagent({
mode: "debate",
rounds: 2,
tasks: [
{ agent: "planner", task: "Product Manager view: what is the real product move? What should the roadmap prove? ..." },
{ agent: "reviewer", task: "User Advocate view: what must matter for UX and trust? ..." },
{ agent: "reviewer", task: "Combatant view: why is this roadmap wrong, overbuilt, or solving the wrong thing? ..." },
{ agent: "planner", task: "Architect view: system fit and sequencing synthesis. ..." },
{ agent: "researcher", task: "Researcher view: comparable products, OSS tools, market expectations. ..." },
{ agent: "planner", task: "Delivery Lead view: smallest credible milestone sequence and scope cuts. ..." }
]
})
```
The parent does **weighted synthesis**, not majority vote. Confidence-by-area trumps headcount.
### Pattern 4 — Code review at slice close
Already wired into `validate-milestone` (3 parallel reviewers: Requirements Coverage, Cross-Slice Integration, Acceptance Criteria) — invoke the prompt rather than rolling your own.
For non-milestone changes, use [`requesting-code-review`](../requesting-code-review/SKILL.md) which dispatches advocate + challenger-A + challenger-B before submitting.
## Prompt Template (per subagent)
```
TASK: <one-sentence summary>
CONTEXT:
- Repo: /home/mhugo/code/singularity-forge
- Key files: <list specific paths and what they contain>
- Architecture: <brief relevant context phase machine, harness boundary, etc.>
- Doctrine: AGENTS.md and docs/SPEC_FIRST_TDD.md
WHAT TO DO:
<what you need done be precise about scope; let them choose method>
CONSTRAINTS:
- Follow the Iron Law (failing test first, no completion without a real consumer).
- <preferences, patterns, things to avoid>
- <what the parent has already tried, if dispatching because stuck>
OUTPUT:
<expected structure JSON, bullet list, paragraph; end with a verdict line if applicable>
```
Things to make explicit because the subagent has no context inheritance:
- **Specific files** (paths, line ranges) — "look at `src/resources/extensions/sf/auto/loop.ts:140-180`", not "the auto-loop".
- **Output format** — don't leave it open-ended.
- **Constraints** — sf-specific rules (`AGENTS.md`, `.sf/DECISIONS.md`).
- **Falsifier** for review tasks — what observable condition would change the verdict.
## Synthesis
After a parallel or debate batch returns, the parent agent **must** synthesise. The synthesis goes back into the unit's artefacts so the next phase sees it:
- For research swarms: pick the highest-confidence finding; document why the others were rejected.
- For adversarial reviews: state the strongest support and strongest objection; either answer the objection or accept it as residual risk in writing.
- For stakeholder swarms: weighted synthesis with confidence-by-area; do not majority-vote.
Persist non-trivial syntheses to memory:
```
sf_save_memory(
category="design-synthesis",
content="<one-line synthesis of the swarm result> — slice <id>",
confidence=<0.0-1.0>
)
```
## Hard Rules
- **Stay inside sf.** Do not shell out to `claude`, `codex`, or any other external coding-agent CLI. sf's own subagent fabric handles all of this with proper trace integration.
- **Use the `validation` tier** for adversarial / advisory subagents. The whole point of an advisory-partner is that it's *not* the planning model.
- **Don't reuse the same lens twice in one swarm.** If two challengers attack from the same angle, you have one challenger with two voices — not two perspectives.
- **Pass the doctrine paths** in every subagent prompt: `AGENTS.md`, `docs/SPEC_FIRST_TDD.md`, the relevant `.sf/DECISIONS.md` row. The subagent has no context inheritance.
- **Synthesise — don't just collect.** A swarm result without synthesis is a list of opinions. The parent owns the decision.
- **If all subagents report low confidence, gather better evidence** rather than spawning more opinions. More voices saying "I don't know" is not signal.
## Failure Modes
- **Garbage subagent output** — prompt was too vague. Rewrite with specific paths and clearer expected output.
- **Subagent contradicts the parent's framing** — your framing was probably wrong. Re-verify facts before synthesis.
- **All subagents agree quickly** — check whether you accidentally framed the prompts toward a single conclusion. Adversarial pairs that always agree aren't adversarial.
- **Subagent timeout** — break the task into smaller dispatches; for research, prefer a `research` tier model with longer context window.
## Cross-References
- [`advisory-partner`](../advisory-partner/SKILL.md) — the canonical adversarial-review framework. Wraps the advocate / challenger / falsifier pattern.
- [`brainstorming`](../brainstorming/SKILL.md) — invokes parallel research and adversarial subagents during the design step.
- [`code-review`](../code-review/SKILL.md) — the multi-lens review skill; can be parallelised by dispatching one subagent per lens.
- [`gate-evaluate.md`](../../prompts/gate-evaluate.md) (prompt) — pre-execution Q3/Q4 gates dispatched as parallel subagents.
- [`validate-milestone.md`](../../prompts/validate-milestone.md) (prompt) — post-execution milestone validation with 3 parallel reviewers.
## Future: Full Swarm Chat
Round-robin debate mode exists now: `subagent({ mode: "debate", rounds: 3, tasks: [...] })`.
Still deferred:
- **Full swarm chat** — agent-to-agent `send_message` during an ephemeral swarm, scoped by `swarm_id` with a TTL. Reuses the persistent-agent inbox machinery from `SPEC.md` §1718. Best fit for open-ended multi-stakeholder Vision Alignment Meeting use cases.
Full design, sequencing, risks, and implementation sketch in **[ADR-011](../../../../../../docs/dev/ADR-011-swarm-chat-and-debate-mode.md)**. Tracked in [`BUILD_PLAN.md`](../../../../../../BUILD_PLAN.md) (Tier 1+ active follow-ups).

View file

@ -0,0 +1,133 @@
---
name: finish-and-verify
description: Use for the last mile of a slice or task. Rerun verification, inspect the diff, persist evidence, optionally commit/push. The slice-done gate that prevents premature completion claims.
---
# Finish And Verify
## Purpose
The closing gate for a slice or task. Stops "looks green to me" from becoming "marked complete" without fresh verification, without consumer-path proof, and without recorded evidence.
Use this skill to:
- Rerun final verification (typecheck, tests, lint).
- Inspect the actual diff before declaring done.
- Persist evidence to the unit artefacts (`.sf/active/{unit-id}/`).
- Commit / push when the user asks.
This skill does NOT decide whether a slice is shippable in isolation — that's `validate-milestone` or `complete-slice`. This skill ensures the artefact is honest before those gates run.
## When to Run
- After `spec-first-tdd` reaches GREEN on the contract test.
- After a debugging fix from `systematic-debugging`.
- Before calling `sf_complete_task` or `sf_complete_slice`.
- Inside an autonomous iteration loop, between slices.
If used inside an autonomous iteration loop and the user goal is still in progress:
1. Verify the current slice.
2. Record evidence.
3. Refresh `.sf/active/{unit-id}/active-slice.md` if one exists.
4. **Return control to the loop. Do not stop just because the slice is green.**
## Slice Done Gate
Do not declare a slice done until **all** are true:
- [ ] Contract test is green.
- [ ] Required component verification is green (`npm run typecheck:extensions`, `npm test`, lint clean for touched files).
- [ ] Consumer-path check is explicit: `rg` confirms a real caller still depends on the changed symbol.
- [ ] Active-slice artefact (`.sf/active/{unit-id}/active-slice.md` or equivalent) is updated with current state.
- [ ] Falsifier from the plan was either checked or admitted as residual risk.
- [ ] No slice-local next step remains undocumented.
If any item fails: do not mark done. Either fix the gap or capture it as a follow-up unit.
## Final Verification Matrix
```bash
# State
git status --short
git diff --stat
git diff # actually look at it
# Compile + check
npm run build:core
npm run typecheck:extensions
# Test
npm run test:unit # full unit suite
npm run test:integration # integration if relevant
# Lint (when configured)
npx eslint <touched-files>
```
For changes in `packages/<pkg>/`:
```bash
cd packages/<pkg> && npx tsc --noEmit
```
For native (Rust) changes:
```bash
npm run build:native
ldd native/npm/linux-x64-gnu/forge_engine.node | grep -E "not found" || echo "OK"
```
## Git Workflow
Stage only task-relevant files in a dirty worktree. Never `git add -A` blindly.
```bash
git add path/to/changed-file.ts path/to/changed-file.test.ts
git commit -m "type(scope): short description"
```
Commit message: Conventional Commits format (`feat:`, `fix:`, `refactor:`, etc.) — required by the commit-msg hook in this repo. Reference the unit/issue when relevant.
Push only on explicit user request:
```bash
git push origin HEAD
```
Never `--force` to a shared branch unless the user specifically asks.
## Evidence to Persist
Before declaring done, capture in `.sf/active/{unit-id}/`:
- The failing test that motivated the change (test name + commit SHA before fix).
- The passing test result (test name + commit SHA after fix).
- Lint result.
- Build result when relevant.
- The consumer path (one line: "called by `<file>:<symbol>`").
- The value at risk (one line: "if wrong, `<consequence>`").
For non-trivial slices:
```markdown
## Evidence Markers
- **Observed**: <facts from runtime/test output>
- **Inferred**: <intended contract these facts support>
- **Proposed**: <change applied>
- **Confidence**: <0.01.0><one-line reason>
- **Falsifier**: <observable condition that would prove this wrong>
- **Reflection**: <weakest assumption, next verification step>
```
For runtime/provider/transport changes, also capture explicit before/after repro evidence — traces alone are insufficient when boundary behaviour changed.
## Rules
- Do not claim completion without fresh verification — even if you ran tests 5 minutes ago, run them again after the last edit.
- Do not claim deployment success from a `docker-compose up` or `npm run build` alone — verify a real consumer path executes.
- Stage only task-relevant files in a dirty worktree.
- Never bypass the commit-msg hook (`--no-verify`) without explicit user authorisation.
- Never `--force-push` to a shared branch without authorisation.
- If verification fails, do not mark done — either fix it or escalate via `systematic-debugging`.

View file

@ -0,0 +1,105 @@
---
name: purpose-driven-development
description: Purpose-Driven Development (PDD) for non-trivial sf changes. Use when implementing or reviewing behaviour changes, bug fixes, or refactors where purpose, consumer, contract, failure boundary, and evidence should be explicit before any code is written. Companion to spec-first-tdd — PDD scopes the change; TDD pins it with a test.
---
# Purpose-Driven Development
Use this skill for non-trivial work where implementation can drift away from user-visible behaviour. PDD makes the work explicit before code so that the test, the implementation, and the evidence all answer the same question.
This is the lightweight scoping companion to [`spec-first-tdd`](../spec-first-tdd/SKILL.md). PDD answers *what is this change?*; spec-first-tdd answers *how do we prove it?*. Use PDD first when scope is unclear; jump straight to spec-first-tdd when the contract is already obvious.
## What PDD Means
Before coding, name each of the following explicitly. If any one is missing, the change is underspecified — go back.
| Field | Question it answers |
|---|---|
| **Purpose** | What outcome must become true? |
| **Consumer** | Who depends on that outcome? (production caller, not a test) |
| **Contract** | What observable behaviour proves success? |
| **Failure boundary** | What does *correct failure* look like if the purpose can't be fulfilled? |
| **Evidence** | What test, repro path, or smoke check proves the contract? |
| **Non-goals** | What is this change *not* solving? |
| **Invariants** | What must remain true while making the change? |
These are the same fields the **Purpose Gate** in [`docs/SPEC_FIRST_TDD.md`](../../../../../../docs/SPEC_FIRST_TDD.md) requires of every artefact. PDD is how you fill them in before you start.
## Workflow
1. State the **purpose** in one or two sentences.
2. Name the **consumer** precisely — file, function, route, command. If you can't, stop.
3. Define the observable **contract** — what the consumer receives, not what the implementation does internally.
4. Define the **failure boundary** — degradation is not crash; surface, don't swallow.
5. State **non-goals** and **invariants**.
6. Choose the **evidence** before changing code — what test or repro will prove the contract is met.
7. Write or identify the failing behaviour test or repro for non-trivial work (hand off to `spec-first-tdd`).
8. Implement the *minimum* change that satisfies the contract.
9. Verify using the evidence chosen in step 6 — not a different one chosen post-hoc.
## Rules
- **Behaviour is the spec; mechanics are secondary.** A test that asserts how the implementation works internally is a rubber stamp.
- **Do not write tests that mirror the current implementation.** They lock in bugs and break on refactor.
- **Prefer observable outcomes over call counts and internal wiring assertions.** `mockFn.calls.length === 2` is mechanical, not purposeful.
- **If a bug blocks the purpose, fix the bug — do not route around it.** A workaround that leaves the bug in place is an anti-pattern.
- **If the purpose is unclear, stop and clarify it before coding** (use `clarify-spec`).
- **Failure boundary is part of the contract.** A function that "works in the happy path" but corrupts state on error has not satisfied PDD.
## Good Contracts
Observable, consumer-facing, falsifiable:
- "Startup stays in the normal flow when the SOPS secret file is reachable; degrades to env-var-only mode (with a single warning) when it isn't."
- "Cancelling a follow-up removes the scheduled task from `.sf/runtime/tasks.json` and emits one cancellation event."
- "Background task completion emits exactly one user-visible notification."
- "`sf headless --output-format json` writes a single newline-terminated JSON object to stdout — never partial JSON, never multiple objects."
## Weak Contracts
Mechanical, internal-wiring, refactor-fragile — rewrite these:
- `validateCredentials()` should call `fetch()` twice.
- The cache layer should set `Map.size === 3` after warm-up.
- `claim()` should invoke `db.update()` with `where.id === unitId` exactly once.
These can be useful as labelled implementation guards — but they are not the primary contract.
## Failure-Boundary Examples
| Purpose | Failure boundary |
|---|---|
| Native engine handles grep at full speed. | If `forge_engine.node` fails to load, fall back to JS implementation, log degraded mode once, do not crash. |
| Provider lookup returns a model. | If no provider in `allowed_providers` has a working API key, raise `ErrModelUnavailable` with the list of providers tried — never silently route to a disallowed model. |
| Auto-loop dispatches the next eligible unit. | If no unit is eligible AND no unit is waiting on a recoverable blocker, terminate the loop with verdict `idle` — never spin. |
A function without a stated failure boundary either doesn't fail (very rare) or fails in undefined ways (very common).
## When to Skip PDD
Skip this skill for trivial changes:
- Typo fixes.
- Lint/format-only changes.
- Dependency bumps that don't change behaviour.
- Renames that don't cross module boundaries.
For everything else — feature work, bug fixes, refactors that touch behaviour, anything user-visible — PDD takes 5 minutes and prevents 50 minutes of wrong implementation.
## Output
Persist the PDD packet to the unit's artefacts so the next phase (TDD) and the reviewer (`requesting-code-review`) start from the same frame:
```markdown
## PDD — <slice or change name>
- **Purpose**:
- **Consumer**:
- **Contract**:
- **Failure boundary**:
- **Evidence**:
- **Non-goals**:
- **Invariants**:
```
Save to `.sf/active/{unit-id}/pdd.md` (or inline at the top of the slice plan). When the slice completes, this packet feeds the `Evidence` block of `requesting-code-review`.

View file

@ -0,0 +1,173 @@
---
name: receiving-code-review
description: Use when receiving code review feedback for sf. Verify before implementing, push back with evidence when wrong, no performative agreement. Actions over words.
---
# Receiving Code Review
## Core Principle
Verify before implementing. Technical correctness over social comfort. Actions over words.
A reviewer's job is to find real issues. Your job is to verify each point and either fix it or push back with evidence — not to agree quickly so the conversation ends.
## Skill Chain
Side-chain skill — activates when feedback arrives on already-delivered work.
```
← prev: (invoked when review feedback arrives)
→ next: return to spec-first-tdd / wave-implementation if the fix needs new work,
or finish-and-verify if it's a small in-scope fix
```
If the fix is trivial (lint, style, naming): apply it inline and return to `finish-and-verify` without stopping.
If the fix requires new contract work: invoke `spec-first-tdd` (write a failing regression test first; the reviewer found a real behaviour gap → that's a missing test).
Do not pause for approval mid-fix unless the review widens scope or changes the approved plan materially.
## The Response Pattern
```
1. READ: Complete feedback without reacting.
2. UNDERSTAND: Restate the requirement — or ask for clarification.
3. VERIFY: Check against the codebase. Tests, callers, history.
4. EVALUATE: Is this technically sound for sf's contract?
5. RESPOND: Technical acknowledgment, or reasoned pushback.
6. IMPLEMENT: One item at a time. Test each. Verify each.
```
## Forbidden Responses
Never:
- "You're absolutely right!"
- "Great point!"
- "Let me implement that now" (before verification).
- Any thanks/gratitude expression.
Instead:
- Restate the requirement.
- Ask for clarification.
- Push back with evidence.
- Or just fix it and let the diff speak.
## Verify Before Implementing
Before changing any code based on review feedback:
```bash
# Does the suggested pattern exist elsewhere?
rg -n "<suggested pattern>" src/ packages/
# Who depends on the symbol the reviewer wants changed?
rg -nF "<symbol>" src/ packages/
# Why is this written this way? (look at history before disagreeing)
git log --follow -p -- <file>
git blame -L <range> <file>
# Do existing tests cover the contract the reviewer wants?
rg -n "<test-name-pattern>" src/ packages/
```
Search memory for prior context the reviewer might be unaware of:
```
sf_search_memories(query="<topic of feedback>", limit=5)
```
If you can't verify, say so: *"I can't verify this without checking <X>. Investigating first."*
Label triage claims:
- `Observed:` directly from code, tests, logs, traces, or tool output.
- `Inferred:` conclusion supported by observed evidence.
- `Proposed:` rebuttal or implementation not yet validated.
For non-trivial runtime/provider review items, verification must include:
- An explicit repro/debug pass for the failing boundary.
- The same repro/debug pass after the fix.
- Not just trace reading or code inspection.
## sf-Specific Checks Before Implementing a Suggestion
Before changing code based on a suggestion, verify it doesn't violate sf's invariants:
- **Iron Law**: Does the suggestion ask you to skip the failing test → fix → green cycle? Push back. The bug is a missing test; write it first.
- **Purpose Gate**: Does the suggestion add an exported symbol without a Purpose / Consumer / Value-at-risk? Push back.
- **YAGNI**: Does the suggestion add abstraction with zero current callers? Run `rg` for callers — if zero, push back: *"Nothing calls this in production. Adds dead code. Remove per YAGNI?"*
- **Recent decisions**: If the suggestion contradicts a `.sf/DECISIONS.md` entry or a recent ADR (`docs/dev/ADR-*.md`), check the history before implementing or pushing back.
- **Self-modification boundary**: If the suggestion edits a protected file (`SPEC.md`, `BUILD_PLAN.md`, `AGENTS.md`, etc.) without explicit human approval, push back — those need human sign-off.
## When to Push Back
Push back when:
- The suggestion violates the Iron Law or Purpose Gate.
- The suggestion introduces dead code (YAGNI).
- The suggestion breaks existing behaviour (`npm test` fails after applying).
- The reviewer lacks full context (callers count, blast radius, prior decision).
- The suggestion conflicts with a deliberate architectural decision in `DECISIONS.md` or an ADR.
How: with evidence, not opinion.
> "I checked — `claim()` has 47 callers across `auto-dispatch.ts` and `worktree-manager.ts`. Renaming would break all of them; I'd suggest deprecating with a re-export instead."
> "ADR-009 chose the orchestration kernel split deliberately to keep `auto-loop` and `auto-dispatch` separable. Merging them as suggested would re-couple the boundary."
## Review Debate and Synthesis
For architectural, boundary, or behaviour-preservation review comments, run an adversarial pass via [`advisory-partner`](../advisory-partner/SKILL.md): advocate strengthens the reviewer's position; challenger strengthens the original design's position. Record a **Review Synthesis** when the review changes a contract or boundary.
If the unit's `proposal.md` exists, validate review comments against the declared requirement deltas before implementing them.
## Implementing Feedback
Review fixes follow the same TDD contract as any other change. **A reviewer finding a bug = a missing test.** For non-trivial review fixes, write the failing test first via `spec-first-tdd` before changing code. The test proves the reviewer's concern was real and prevents regression.
```
For multi-item feedback:
1. Clarify anything unclear FIRST — ask before touching code.
2. Order: blocking issues → simple fixes → complex changes.
3. For non-trivial fixes: failing test FIRST, then fix.
4. Run tests after each item — not all at the end.
5. typecheck + lint on changed files before claiming done.
```
Don't batch everything and run the gate once at the end — test each change independently. A single gate at the end can hide which fix introduced a regression.
Keep scope tight: fix the validated review issue, not every adjacent redesign the comment makes tempting.
After resolving significant review comments:
```
sf_save_memory(
category="review-learning",
content="review caught: <what> in <component> — prevent by <design principle or test gap>",
confidence=0.9
)
```
Skip for trivial lint/style fixes — only record when the reviewer found a real behaviour gap.
## Acknowledging Correct Feedback
✅ "Fixed — <one sentence what changed>."
✅ Just fix it; the diff speaks.
❌ "You're absolutely right!"
❌ "Thanks for catching that!"
❌ Any gratitude expression.
## If You Pushed Back and Were Wrong
✅ "Verified — you're correct. <What changed>. Fixed."
❌ Long apology.
❌ Defending the pushback.
Factual. Move on.

View file

@ -0,0 +1,159 @@
---
name: requesting-code-review
description: Use when submitting work for review (PR creation or peer review). Prepares a structured summary with evidence — test results, typecheck, blast radius, consumer verification — so the reviewer has everything needed to decide.
---
# Requesting Code Review
## Purpose
Make the reviewer's job easy. Provide evidence, not claims. Test output, not "I think it's clean."
A review request without evidence is a waste of the reviewer's time. Either you didn't run the gates, or you did but didn't trust them enough to share — both signal the change isn't ready.
## When to Run
- A slice is GREEN and `finish-and-verify` has passed.
- Opening a PR for human or peer review.
- Inside the auto-loop when a unit reaches `PhaseReview`.
## Skill Chain
```
← prev: finish-and-verify (verification is already green)
→ next: receiving-code-review (when reviewer feedback arrives)
```
This skill is the optional terminal step of the delivery chain — invoked after verification confirms the change is honest.
## Before Requesting Review
All of these must pass. If any fails, **fix first; don't request review.**
```bash
npm run typecheck:extensions # zero errors
npm run test:unit -- <pattern> # all relevant tests pass
npm run test:integration # if integration boundaries touched
rg -nF "<changed symbol>" src/ packages/ # production caller exists
```
For non-trivial runtime/provider fixes, also capture:
- Explicit repro/debug output **before** the fix.
- Same repro/debug output **after** the fix.
- Proof that the solved boundary matches the original failure (not just "tests pass now").
Search prior memory for related context the reviewer might need:
```
sf_search_memories(query="<area of change>", limit=5)
```
## Review Summary Template
```markdown
## What changed
<13 sentences: what behaviour changed and why>
## Files changed
- `src/path/to/file.ts`<one-line description>
- `src/path/to/file.test.ts`<one-line description>
## Linked work
- Slice: <unit-id> / Milestone: <unit-id>
- Requirement: <Active R-id from REQUIREMENTS.md, if applicable>
- Decision: <DECISIONS.md entry, if applicable>
## Purpose & Consumer
- **Purpose**: <why this exists>
- **Consumer**: <production code path that depends on this file:symbol>
- **Value at risk**: <what breaks if it returns the wrong answer>
## Evidence
- Typecheck: ✅ `npm run typecheck:extensions`
- Tests: ✅ <X passed, 0 failed, Y skipped>
- Lint: ✅ <touched files clean>
- Repro before fix (non-trivial only): <command + output snippet>
- Repro after fix (non-trivial only): <command + output snippet>
## Blast radius
<callers count, modules touched, transitive surface>
## Risk
<low | medium | high based on blast radius and change complexity>
## Falsifier
<the observable condition that would prove this contract wrong, and whether it's been checked>
## Scope defence
<one sentence what tempting adjacent work this slice refused>
```
For architecture-heavy or boundary-heavy changes, append:
```markdown
## Review Synthesis
- Advocate strengthened: <strongest case for the design>
- Challenger argued: <strongest case against>
- Falsifier outcome: <was it checked? how?>
- Why the boundary still stands:
```
## Tools for Building the Summary
```bash
git diff --stat HEAD~1 # what actually changed
rg -nF "<changed symbol>" src/ packages/ # callers
git log -p -- <changed-file> # recent history for context
git blame -L <range> <file> # who wrote the touched lines
```
## Local Review Loop (optional, for non-trivial changes)
Before requesting external review on a high-risk change, run an internal adversarial pass via [`advisory-partner`](../advisory-partner/SKILL.md):
1. Assemble evidence (typecheck, tests, blast radius, consumer).
2. Write the review summary above.
3. Spawn the default review bundle:
- **Advocate** — strengthens the design.
- **Challenger-A** — attacks the design's strongest assumption.
- **Challenger-B** — attacks from a different lens (e.g. security if A was correctness).
4. Collect each reviewer's structured output, including confidence.
5. Synthesise: fill the gap each reviewer exposed; answer or admit each objection in the summary.
6. Escalate only if disagreement remains material after Step 5.
Hard rules for the local loop:
- Do not let all reviewers share the same lens.
- The advocate must have repo access (frame as: "investigate whether X is safe; argue your side").
- If the advocate contradicts your framing, your framing was wrong — re-verify before synthesis.
- If all reviewers report low confidence, gather better evidence rather than adding more opinions.
Stop when:
- The reviewer can see purpose, evidence, and risk without guessing.
- The strongest objection is either answered or admitted as residual risk.
- Another review pass would not change the decision materially.
## After Submitting
Persist the delivery context so future units can trace what was built and why:
```
sf_save_memory(
category="delivery",
content="delivered: <what changed> — slice: <id> — consumer: <caller> — insight: <what was learned>",
confidence=0.9
)
```
## Rules
- Never request review with failing typecheck or tests.
- Include evidence in the summary — not "I think it's clean."
- List every changed file — don't hide changes.
- If blast radius is high (>10 transitive callers), flag it explicitly.
- For non-trivial runtime/provider changes, include the debug/repro evidence — not just trace summaries.
- For architecture-heavy changes, include disagreement evidence — what advocate strengthened, what challenger attacked.
- Mark major claims as `Observed`, `Inferred`, or `Proposed`.
- Include the strongest reason the change could still be wrong even if tests pass.

View file

@ -0,0 +1,189 @@
---
name: spec-first-tdd
description: Use for any sf behaviour change. Purpose and consumer first, failing test before code, then lint/typecheck/build, then evidence. Operationalises the constitution in `docs/SPEC_FIRST_TDD.md`. Hand off to wave implementation only when the slice is too broad for one focused change.
---
# Spec-First TDD
This skill operationalises the constitution in [`docs/SPEC_FIRST_TDD.md`](../../../../../../docs/SPEC_FIRST_TDD.md). Read it first if you haven't. The doc owns the philosophy; this skill owns the procedure.
## Iron Law
```
THE TEST IS THE SPEC. THE JSDOC IS THE PURPOSE. CODE EXISTS TO FULFILL PURPOSE.
NO BEHAVIOUR CHANGE WITHOUT A FAILING TEST FIRST.
NO COMPLETION WITHOUT A REAL CONSUMER.
NO JUDGMENT CALL WITHOUT A CONFIDENCE AND FALSIFIER.
```
## When to Run
- A slice has been planned, the user has approved it (`go`, or auto-mode portfolio approval), and the contract is decision-complete.
- A bug fix needs a regression test.
- A reviewer found a real behaviour gap (invoke from `receiving-code-review`).
## Skill Chain
```
brainstorming → clarify-spec (if underspecified) → spec-first-tdd → [wave-implementation if needed] → finish-and-verify
```
Default expectation:
- Write the contract test first.
- Make the smallest passing implementation locally.
- Hand off to multi-file implementation only when the slice is genuinely wave-shaped.
Side-chain (any bug/failure during TDD): invoke `systematic-debugging`. When the fix is verified, return here and continue.
## Workflow
### 1. Purpose, Consumer, Value
Before any test, answer:
- **Purpose:** why this behaviour exists.
- **Consumer:** which production code path depends on it (`rg -n "fnName" src/ packages/`). If you can't name a real caller, stop.
- **Value at risk:** what breaks (and who notices) if it returns the wrong answer.
- **Out of scope:** what this slice explicitly does *not* cover.
Label evidence:
- `Observed:` current behaviour, failing output, caller fact.
- `Inferred:` intended contract supported by those facts.
- `Proposed:` minimal change to satisfy the contract.
### 2. Choose the Test Type
| Behaviour | Test type |
|---|---|
| Pure logic, local invariants | Unit |
| Interface/schema contracts | Contract |
| Storage, orchestration, cross-module | Integration |
| Existing behaviour you must preserve | Characterisation |
| State machines, routing, normalisation | Property/invariant |
### 3. Write the Failing Test (this is the spec)
```ts
import { test } from "node:test";
import assert from "node:assert/strict";
// behaviour contract: claim() rejects takeover when claim_until > now()
test("claim_when_active_lease_returns_false", () => {
const now = 1000;
const result = claim({ unitId: "u1", leaseMs: 60_000, now, holder: "w1", existingClaimUntil: now + 30_000 });
assert.equal(result.acquired, false);
});
```
Rules:
- Name = `<what>_<when>_<expected>`. The name **is** the contract claim.
- Test through the public contract production callers use. Don't expose `_helpers`.
- Assert what the consumer receives, not internal wiring.
### 4. Verify RED
```bash
npm run test:compile && node --test --experimental-test-isolation=process \
dist-test/path/to/file.test.js
```
Confirm: fails for the **right** reason (feature missing, not a typo). If it passes immediately, you're testing existing behaviour — fix the test.
### 5. Smallest Production Change
Smallest edit that makes the test green while serving the purpose. No surrounding cleanup, no YAGNI features. Do not weaken the test.
### 6. Verify GREEN
```bash
npm run typecheck:extensions
npm run test:unit -- <pattern>
```
All tests green, zero TS errors. Then refactor while green: tighten types, names, JSDoc.
### 7. Confidence Gate
After each major step, state confidence as `0.01.0` plus a one-line reason.
| Step | Threshold | Below threshold |
|---|---|---|
| Purpose & consumer | 0.95 | Run `advisory-partner` Q3 (abuse) + Q5 (strongest objection). |
| Contract test | 0.90 | Adversarial review wave. |
| Implementation | 0.95 | Add a specialist reviewer for the boundary touched. |
| Evidence | 0.97 | Full advocate + challenger + specialist. |
LLM confidence is poorly calibrated in absolute terms — the relative signal matters. If you write 0.7, you're guessing. Act on it.
### 8. Verify Consumer and Record Evidence
- Re-run `rg` for the changed symbol; confirm a real caller still depends on it.
- For non-trivial runtime/provider changes, capture a **before** repro and an **after** repro.
- For non-trivial slices, persist the contract to sf memory:
```
sf_save_memory(
category="contract",
content="<symbol><what the test proved> — slice <sliceId>",
confidence=0.9
)
```
This feeds the next unit's `research` phase via the memory recall step.
## Authorship Boundary
This skill owns:
- The contract test.
- The first minimal implementation.
- The RED → GREEN proof for one focused slice.
Escalate to wave-style multi-file implementation when:
- The change spans 4+ files with independent test surfaces.
- Sub-parts need independent review gates.
- A bounded refactor remains after the contract is green.
If the change is already green and verified here, go directly to `finish-and-verify`.
## Brownfield Rules
For existing behaviour:
1. Find the real consumer first (`rg`, `git log -- <path>`).
2. Write a characterisation test for the current contract.
3. Decide: correct, accidental, or obsolete?
4. Change code to satisfy the intended contract.
5. Delete tests that only preserve dead compatibility paths.
## Red Flags — STOP
- Code written before a failing test.
- Test passes immediately without seeing it fail (you're testing existing behaviour, or the test is wrong).
- Acceptance criteria like "make it better" or "clean this up".
- No production caller identified.
- Test adjusted to match bad code — fix the code; tests are the spec.
- JSDoc missing `Purpose:` line — code without stated purpose cannot be verified.
- Test asserts call counts or mock interactions — that's mechanical, not purposeful. Label it as a guard or rewrite.
## Bug Fixes
Bug = missing correct-behaviour test.
1. Write a test for the **correct** behaviour. It must fail (RED) because the bug exists.
2. If it passes immediately → the test is testing the broken behaviour. Fix the test first.
3. Fix the code to make the correct-behaviour test GREEN.
Never fix code without a failing test first. The test is the proof that the bug existed and is now gone.
## Evidence (mandatory before declaring done)
- Failing test → passing test result.
- `npm run typecheck:extensions` clean.
- For non-trivial runtime/provider fixes: explicit repro before code, solved boundary after.
- Updated active-slice artifact when one exists in `.sf/active/{unit-id}/`.

View file

@ -0,0 +1,147 @@
---
name: systematic-debugging
description: Use for any bug, failing test, unexpected behaviour, or production regression in sf. Evidence first — logs, traces, event log, code search — before proposing any fix. Side-chain skill that returns to the caller after the fix is verified.
---
# Systematic Debugging
## Iron Law
```
NO FIX WITHOUT ROOT-CAUSE INVESTIGATION FIRST.
NO NON-TRIVIAL FIX WITHOUT AN EXPLICIT REPRO.
```
If you haven't completed Phase 1, you cannot propose fixes.
## When to Run
- A test fails, RED happens for the wrong reason, GREEN regresses, build fails, lint fails.
- An sf auto-loop unit hits an unexpected error, abandons, or stalls.
- Production behaviour does not match the contract.
- Reviewer reports a real bug.
## Skill Chain
Side-chain skill. Interrupts the main delivery chain, fixes the issue, returns.
```
← prev: any skill in the main chain (invoked on bug/failure)
→ next: return to the interrupted skill (auto-continues after fix verified)
```
Debugging is part of delivery, not a separate approval gate. Do not pause for permission; resume the main chain after the fix is verified.
## Phase 1 — Gather Evidence
Smallest reliable reproduction first. Be specific about what fails, where it fails, and what changed recently.
```bash
# Code/git context
rg -n "<symbol>|<error message>|<route>" src/ packages/
git log -- <path>
git diff -- <path>
# sf runtime evidence
ls -lat .sf/event-log.jsonl .sf/traces/ 2>/dev/null
tail -200 .sf/event-log.jsonl | jq 'select(.kind=="error" or .level=="error")'
ls .sf/active/ # in-flight units
ls .sf/audit/ # historical audit dumps
```
For LLM/provider/transport bugs, capture all of:
- canonical `provider/model` requested
- which provider transport handled the call
- whether the failure was pre-resolution, in-transport, or inside the provider
- whether the trace can reproduce it
For native-engine bugs (`forge_engine.node`):
```bash
ldd native/npm/linux-x64-gnu/forge_engine.node 2>&1 | grep -E "not found|missing"
```
When evidence collection splits into independent streams, fan out with parallel subagents (`Explore` for repo search, `Plan` for design analysis). Keep the root-cause + repro doctrine in this skill.
## Phase 2 — Identify Root Cause
1. Read error messages completely — stack traces, file paths, line numbers.
2. Recent changes: `git log --since="2 days ago"` — what shipped before this broke?
3. Find a working example: `rg` for similar usages that work.
4. Compare working vs broken — list every difference.
5. State the hypothesis: "I think X is the root cause because Y."
Label all claims:
- `Observed:` directly from logs, diffs, tests, traces.
- `Inferred:` explanation supported by those facts.
- `Proposed:` fix or next experiment.
- `Confidence:` 0.01.0 with one-line reason.
- `Falsifier:` what result would disprove this diagnosis.
- `Reflection:` weakest assumption, next diagnostic check.
### Confidence Gate (Phase 2 → Phase 3)
| Gate | Threshold | Below threshold |
|---|---|---|
| Root cause identified | 0.90 | Run `advisory-partner` adversarial wave (advocate + challenger). |
| Fix verified (Phase 4) | 0.95 | Re-run repro path. If still below, expand the wave. |
Skip the gate for trivial failures (typo, missing import, obvious one-line stack trace).
## Phase 3 — Test Hypothesis Minimally
**Bug = missing correct-behaviour test.**
1. Write the failing test specifying the correct behaviour. It must fail (RED) because the bug exists.
2. If it passes immediately → the test is testing the broken behaviour. Fix the test first.
3. Run via `npm run test:unit -- <pattern>` or the relevant compiled test target.
4. If hypothesis validated → Phase 4. If not → new hypothesis, back to Phase 2.
Use `spec-first-tdd` for the contract test if the bug crosses a non-trivial boundary.
## Phase 4 — Fix and Verify
1. Fix the **root cause**, not the symptom.
2. `npm run typecheck:extensions` — clean.
3. `npm run test:unit -- <pattern>` — regression test green.
4. For non-trivial fixes: re-run the same repro/debug path; capture the **solved** boundary output.
5. For runtime fixes: tail the event log and confirm the error class disappears.
Persist the pattern to memory so future units don't re-hit it:
```
sf_save_memory(
category="anti-pattern",
content="<symptom> in <component> — root cause: <one line> — fix: <approach> — test: <name>",
confidence=0.9
)
```
## Subagent Patterns
Use only when evidence collection runs in parallel:
- `Explore` agent — map a failing code path quickly without burning the main context.
- `Plan` agent — design analysis when the failure crosses an architectural boundary.
- An adversarial reviewer via `advisory-partner` when diagnosis confidence is below threshold.
Do not delegate the fix — the main thread owns it. Subagents return paragraphs; you apply the change.
## Rules
- Fix root cause, not symptom.
- Write or update the regression test first.
- Keep before/after repro for non-trivial bugs.
- Three failed fix attempts → stop. Question the architecture. Discuss before attempt 4.
- Never claim "fixed" without re-running the original repro path.
## Red Flags — STOP
- Proposing a fix before checking logs/traces/event log.
- Proposing a non-trivial fix from traces alone when repro is feasible.
- "Probably X, let me fix that" without verification.
- Diagnosis stated without `Observed:` evidence.
- Third fix attempt without questioning architecture.
- Test deleted to make a build green — that's encoding the bug, not fixing it.

View file

@ -0,0 +1,105 @@
---
name: working-in-parallel
description: Use git worktrees to work in parallel with other sf agents or alongside the user. Each worktree is an isolated checkout of a branch; changes in one don't affect others. sf already manages worktrees for slice parallelism — this skill is the manual workflow when you need a parallel branch outside the auto-loop.
---
# Working in Parallel
Use **git worktrees** to work in parallel when another agent is in the same directory or when you need an isolated branch for an out-of-band change.
Git worktrees let you check out multiple branches into separate directories. Each worktree has its own isolated files while sharing the same `.git` and remote connections. Changes in one worktree won't affect others, so parallel agents can't interfere with each other.
sf already uses worktrees internally for slice parallelism (see `auto-worktree.ts`, `worktree-manager.ts`). This skill is the manual workflow for the cases where you need a parallel branch outside that auto-managed flow — a hot-fix, an exploration, or a side-chain debug.
Reference: [Git worktree documentation](https://git-scm.com/docs/git-worktree).
## Before Running Any Command
1. **Read the project's setup notes.** `AGENTS.md`, `CLAUDE.md`, `CONTRIBUTING.md`, `README.md` — in that order. Each may name the canonical commands.
2. **Don't assume the package manager.** sf is `npm` (canonical — never commit `bun.lock` or `pnpm-lock.yaml` per `AGENTS.md`). Other repos differ.
3. **Check for protected files.** sf protects `SPEC.md`, `BUILD_PLAN.md`, `AGENTS.md`, `CLAUDE.md`, `docs/SPEC_FIRST_TDD.md`, `docs/dev/ADR-*.md` from autonomous mutation — those need human approval even on a side branch.
## Quick Start
```bash
# Create worktree with new branch (from the main repo dir)
git worktree add -b fix/my-feature ../singularity-forge-my-feature main
# Work in the worktree
cd ../singularity-forge-my-feature
# Install dependencies (sf is npm)
npm install
# Make changes; verify
npm run typecheck:extensions
npm run test:unit -- <pattern>
# Commit and push
git add <files>
git commit -m "fix(scope): description"
git push -u origin fix/my-feature
# Open a PR
gh pr create --title "fix(scope): description" --body "..."
# Clean up when done (from main repo dir)
git worktree remove ../singularity-forge-my-feature
```
## Key Commands
| Command | Purpose |
|---|---|
| `git worktree add -b <branch> <path> main` | Create a new worktree on a new branch off main |
| `git worktree add <path> <existing-branch>` | Check out an existing branch into a worktree |
| `git worktree list` | Show all worktrees and their branches |
| `git worktree remove <path>` | Remove a worktree (branch is preserved) |
| `git worktree prune` | Clean up references to deleted worktree dirs |
## When to Use
- Another agent (sf auto-loop, another Claude session, a teammate) is working in the current directory.
- A long-running build or test is in flight in one terminal and you need a parallel branch.
- You're exploring a refactor that you may abandon — keep main clean.
- You need to apply an upstream cherry-pick from `pi-mono` while a separate `gsd-2` port is in progress.
## When NOT to Use
- A trivial single-file change — the overhead is not worth it.
- When sf's auto-loop is already managing the worktree for you (don't fight it; check `git worktree list` first).
- When the change touches *protected* files — those need human approval, not a parallel branch.
## Pre-commit Hooks in Worktrees
Worktrees share `.git`, but hook installation may need re-running depending on the project's setup. After creating a worktree and installing dependencies, verify hooks are active before committing:
```bash
ls .git/hooks/ # in a worktree, .git is a file pointing to the main worktree
ls $(git rev-parse --git-common-dir)/hooks/ # the actual hooks dir
```
In sf, install via:
```bash
npm run secret-scan:install-hook
```
## Tips
- **Name worktree directories clearly**: `../singularity-forge-fix-claim-race`, `../singularity-forge-port-pi-mono-3851`.
- **Push before removing.** `git worktree remove` deletes uncommitted work.
- **Don't share `node_modules` between worktrees** — sf has native bindings (`forge_engine.node`) that need a worktree-local install.
- **Run `git worktree prune` periodically** if you delete worktree directories without `git worktree remove`.
## Alternative: Repo Clones
Some users prefer cloning the repo multiple times (`gh repo clone singularity-ng/singularity-forge sf-feature-auth`) for a simpler mental model. This uses more disk space but provides complete isolation. If you find worktrees confusing or hit edge cases (e.g. shared hooks misbehaving), a fresh clone is a fine alternative.
## sf-Specific: Auto-Loop Awareness
If sf's auto-loop is running in the main repo dir, **do not** start a worktree on the same branch — the auto-loop's commits and yours will race. Either:
- Pause the loop (`/sf pause` or `Ctrl-C`).
- Work on a different branch in the worktree.
- Or wait for the current slice to merge.

View file

@ -4,9 +4,10 @@
* Spawns a separate `pi` process for each subagent invocation,
* giving it an isolated context window.
*
* Supports three modes:
* Supports four modes:
* - Single: { agent: "name", task: "..." }
* - Parallel: { tasks: [{ agent: "name", task: "..." }, ...] }
* - Debate: { mode: "debate", rounds: 3, tasks: [{ agent: "name", task: "..." }, ...] }
* - Chain: { chain: [{ agent: "name", task: "... {previous} ..." }, ...] }
*
* Uses JSON mode to capture structured output from subagents.
@ -213,8 +214,10 @@ interface SingleResult {
step?: number;
}
type SubagentMode = "single" | "parallel" | "debate" | "chain";
interface SubagentDetails {
mode: "single" | "parallel" | "chain";
mode: SubagentMode;
agentScope: AgentScope;
projectAgentsDir: string | null;
results: SingleResult[];
@ -249,6 +252,9 @@ function summarizeBackgroundInvocation(
): string {
if (params.chain && params.chain.length > 0)
return `chain:${params.chain.map((step) => step.agent).join("→")}`;
if (params.tasks && params.tasks.length > 0)
if (params.mode === "debate")
return `debate:${params.tasks.map((task) => task.agent).join(",")}`;
if (params.tasks && params.tasks.length > 0)
return `parallel:${params.tasks.map((task) => task.agent).join(",")}`;
if (params.agent) return `single:${params.agent}`;
@ -268,7 +274,7 @@ async function executeSubagentInvocation({
useIsolation,
}: SubagentExecutionContext): Promise<SubagentToolResult> {
const makeDetails =
(mode: "single" | "parallel" | "chain") =>
(mode: SubagentMode) =>
(results: SingleResult[]): SubagentDetails => ({
mode,
agentScope,
@ -362,6 +368,222 @@ async function executeSubagentInvocation({
};
}
const batchTasks = params.tasks;
const taskMode = params.mode ?? "parallel";
if (taskMode === "debate") {
const rounds = params.rounds ?? 2;
if (!Number.isInteger(rounds) || rounds < 1 || rounds > 5) {
return {
content: [
{
type: "text",
text: "Invalid debate rounds. Use an integer from 1 to 5.",
},
],
details: makeDetails("debate")([]),
isError: true,
};
}
if (batchTasks.length < 2) {
return {
content: [
{
type: "text",
text: "Debate mode requires at least two tasks/participants.",
},
],
details: makeDetails("debate")([]),
isError: true,
};
}
const debateResults: SingleResult[] = new Array(
rounds * batchTasks.length,
);
const transcriptEntries: string[] = [];
const emitDebateUpdate = () => {
if (!onUpdate) return;
const knownResults = debateResults.filter(Boolean);
const running = knownResults.filter((r) => r.exitCode === -1).length;
const done = knownResults.filter((r) => r.exitCode !== -1).length;
onUpdate({
content: [
{
type: "text",
text: `Debate: ${done}/${debateResults.length} turns done, ${running} running...`,
},
],
details: makeDetails("debate")([...knownResults]),
});
};
const buildDebatePrompt = (
task: Static<typeof TaskItem>,
round: number,
transcript: string,
): string =>
[
`You are participant "${task.agent}" in a structured multi-agent debate.`,
`Round ${round} of ${rounds}.`,
"Original assignment:",
task.task,
"Debate transcript so far:",
transcript.trim() || "(no prior rounds)",
round === rounds
? "This is the final round. Engage the strongest opposing claims, state what changed your mind if anything did, and end with FINAL_VERDICT: <PROCEED | CHANGE | BLOCK> plus one sentence."
: "Engage the strongest opposing claims directly. Add new evidence or a sharper objection; do not repeat prior points.",
].join("\n\n");
for (let round = 1; round <= rounds; round++) {
for (let i = 0; i < batchTasks.length; i++) {
const resultIndex = (round - 1) * batchTasks.length + i;
const task = batchTasks[i];
debateResults[resultIndex] = {
agent: task.agent,
agentSource: "unknown",
task: task.task,
exitCode: -1,
messages: [],
stderr: "",
usage: {
input: 0,
output: 0,
cacheRead: 0,
cacheWrite: 0,
cost: 0,
contextTokens: 0,
turns: 0,
},
step: round,
};
}
emitDebateUpdate();
const transcript = transcriptEntries.join("\n\n");
const roundResults = await mapWithConcurrencyLimit(
batchTasks,
MAX_CONCURRENCY,
async (t, index) => {
const resultIndex = (round - 1) * batchTasks.length + index;
const prompt = buildDebatePrompt(t, round, transcript);
const taskModelOverride = t.model ?? params.model;
const result = cmuxSplitsEnabled
? await runSingleAgentInCmuxSplit(
cmuxClient,
index % 2 === 0 ? "right" : "down",
defaultCwd,
agents,
t.agent,
prompt,
t.cwd,
round,
signal,
(partial) => {
const currentResult = partial.details?.results[0];
if (!currentResult) return;
currentResult.task = t.task;
currentResult.step = round;
debateResults[resultIndex] = currentResult;
emitDebateUpdate();
},
makeDetails("debate"),
taskModelOverride,
)
: await runSingleAgent(
defaultCwd,
agents,
t.agent,
prompt,
t.cwd,
round,
signal,
(partial) => {
const currentResult = partial.details?.results[0];
if (!currentResult) return;
currentResult.task = t.task;
currentResult.step = round;
debateResults[resultIndex] = currentResult;
emitDebateUpdate();
},
makeDetails("debate"),
taskModelOverride,
);
result.task = t.task;
result.step = round;
debateResults[resultIndex] = result;
emitDebateUpdate();
return result;
},
);
const failed = roundResults.find(
(r) =>
r.exitCode !== 0 ||
r.stopReason === "error" ||
r.stopReason === "aborted",
);
transcriptEntries.push(
...roundResults.map((r) => {
const output =
getFinalOutput(r.messages) ||
r.errorMessage ||
r.stderr ||
"(no output)";
return `## Round ${round}${r.agent}\n\n${output}`;
}),
);
if (failed) {
const errorMsg =
failed.errorMessage ||
failed.stderr ||
getFinalOutput(failed.messages) ||
"(no output)";
return {
content: [
{
type: "text",
text: `Debate stopped in round ${round} (${failed.agent}): ${errorMsg}`,
},
],
details: makeDetails("debate")(debateResults.filter(Boolean)),
isError: true,
};
}
}
const finalResults = debateResults.filter(Boolean);
const summaries = finalResults.map((r) => {
const output =
getFinalOutput(r.messages) ||
r.errorMessage ||
r.stderr ||
"(no output)";
return `[round ${r.step}] [${r.agent}] ${r.exitCode === 0 ? "completed" : `failed (exit ${r.exitCode})`}: ${output}`;
});
return {
content: [
{
type: "text",
text: `Debate: ${finalResults.length}/${debateResults.length} turns succeeded over ${rounds} rounds\n\n${summaries.join("\n\n")}`,
},
],
details: makeDetails("debate")(finalResults),
};
}
if (params.rounds !== undefined) {
return {
content: [
{
type: "text",
text: '`rounds` is only valid with `mode: "debate"`.',
},
],
details: makeDetails("parallel")([]),
isError: true,
};
}
const allResults: SingleResult[] = new Array(params.tasks.length);
for (let i = 0; i < params.tasks.length; i++) {
allResults[i] = {
@ -1137,6 +1359,12 @@ const AgentScopeSchema = StringEnum(["user", "project", "both"] as const, {
default: "both",
});
const TaskBatchModeSchema = StringEnum(["parallel", "debate"] as const, {
description:
'How to execute `tasks`: "parallel" runs all tasks independently; "debate" runs bounded rounds where each task sees prior-round outputs.',
default: "parallel",
});
const SubagentParams = Type.Object({
agent: Type.Optional(
Type.String({
@ -1162,7 +1390,18 @@ const SubagentParams = Type.Object({
),
tasks: Type.Optional(
Type.Array(TaskItem, {
description: "Array of {agent, task} for parallel execution",
description:
'Array of {agent, task} for task-batch execution. Defaults to parallel; set `mode: "debate"` for debate rounds.',
}),
),
mode: Type.Optional(TaskBatchModeSchema),
rounds: Type.Optional(
Type.Integer({
description:
'Number of debate rounds when `mode` is "debate". Default: 2; max: 5.',
minimum: 1,
maximum: 5,
default: 2,
}),
),
chain: Type.Optional(
@ -1279,7 +1518,7 @@ export default function (pi: ExtensionAPI) {
description: [
"Delegate tasks to specialized subagents with isolated context windows.",
"Each subagent is a separate pi process with its own tools, model, and system prompt.",
"Modes: single ({ agent, task }), parallel ({ tasks: [{agent, task},...] }), chain ({ chain: [{agent, task},...] } with {previous} placeholder).",
"Modes: single ({ agent, task }), parallel ({ tasks: [{agent, task},...] }), debate ({ mode: 'debate', rounds, tasks: [...] }), chain ({ chain: [{agent, task},...] } with {previous} placeholder).",
"Agents are defined as .md files in ~/.sf/agent/agents/ (user) or .sf/agents/ (project).",
"Use the /subagent command to list available agents and their descriptions.",
"Use chain mode to pipeline: scout finds context, planner designs, worker implements.",
@ -1289,6 +1528,7 @@ export default function (pi: ExtensionAPI) {
"Use scout agent first when you need codebase context before implementing.",
"Use chain mode for scout→planner→worker or worker→reviewer→worker pipelines.",
"Use parallel mode when tasks are independent and don't need each other's output.",
"Use debate mode for bounded advocate/challenger or multi-role reviews where participants should respond to prior-round outputs.",
"Always check available agents with /subagent before choosing one.",
],
parameters: SubagentParams,
@ -1335,6 +1575,21 @@ export default function (pi: ExtensionAPI) {
};
}
if ((params.mode || params.rounds !== undefined) && !hasTasks) {
return {
content: [
{
type: "text",
text: "`mode` and `rounds` are only valid with parallel task batches (`tasks: [...]`).",
},
],
details: makeDetails(
hasChain ? "chain" : hasSingle ? "single" : "parallel",
)([]),
isError: true,
};
}
if (
(agentScope === "project" || agentScope === "both") &&
confirmProjectAgents &&
@ -1721,7 +1976,8 @@ export default function (pi: ExtensionAPI) {
return new Text(text, 0, 0);
}
if (details.mode === "parallel") {
if (details.mode === "parallel" || details.mode === "debate") {
const modeLabel = details.mode;
const running = details.results.filter((r) => r.exitCode === -1).length;
const successCount = details.results.filter(
(r) => r.exitCode === 0,
@ -1735,13 +1991,15 @@ export default function (pi: ExtensionAPI) {
: theme.fg("success", "✓");
const status = isRunning
? `${successCount + failCount}/${details.results.length} done, ${running} running`
: `${successCount}/${details.results.length} tasks`;
: details.mode === "debate"
? `${successCount}/${details.results.length} turns`
: `${successCount}/${details.results.length} tasks`;
if (expanded && !isRunning) {
const container = new Container();
container.addChild(
new Text(
`${icon} ${theme.fg("toolTitle", theme.bold("parallel "))}${theme.fg("accent", status)}`,
`${icon} ${theme.fg("toolTitle", theme.bold(`${modeLabel} `))}${theme.fg("accent", status)}`,
0,
0,
),
@ -1758,7 +2016,7 @@ export default function (pi: ExtensionAPI) {
container.addChild(new Spacer(1));
container.addChild(
new Text(
`${theme.fg("muted", "─── ") + theme.fg("accent", r.agent)} ${rIcon}`,
`${theme.fg("muted", details.mode === "debate" ? `─── Round ${r.step}: ` : "─── ") + theme.fg("accent", r.agent)} ${rIcon}`,
0,
0,
),
@ -1813,7 +2071,7 @@ export default function (pi: ExtensionAPI) {
}
// Collapsed view (or still running)
let text = `${icon} ${theme.fg("toolTitle", theme.bold("parallel "))}${theme.fg("accent", status)}`;
let text = `${icon} ${theme.fg("toolTitle", theme.bold(`${modeLabel} `))}${theme.fg("accent", status)}`;
for (const r of details.results) {
const rIcon =
r.exitCode === -1
@ -1822,7 +2080,9 @@ export default function (pi: ExtensionAPI) {
? theme.fg("success", "✓")
: theme.fg("error", "✗");
const displayItems = getDisplayItems(r.messages);
text += `\n\n${theme.fg("muted", "─── ")}${theme.fg("accent", r.agent)} ${rIcon}`;
const prefix =
details.mode === "debate" ? `─── Round ${r.step}: ` : "─── ";
text += `\n\n${theme.fg("muted", prefix)}${theme.fg("accent", r.agent)} ${rIcon}`;
if (displayItems.length === 0)
text += `\n${theme.fg("muted", r.exitCode === -1 ? "(running...)" : "(no output)")}`;
else text += `\n${renderDisplayItems(displayItems, 5)}`;

View file

@ -0,0 +1,59 @@
import assert from "node:assert/strict";
import { readFileSync } from "node:fs";
import { dirname, join } from "node:path";
import test from "node:test";
import { fileURLToPath } from "node:url";
const __dirname = dirname(fileURLToPath(import.meta.url));
const subagentSrc = readFileSync(
join(__dirname, "../resources/extensions/subagent/index.ts"),
"utf-8",
);
test("subagent schema declares debate mode and bounded rounds", () => {
const paramsStart = subagentSrc.indexOf(
"const SubagentParams = Type.Object({",
);
const paramsEnd = subagentSrc.indexOf("});", paramsStart);
const paramsBlock = subagentSrc.slice(paramsStart, paramsEnd);
assert.match(paramsBlock, /mode:\s*Type\.Optional\(TaskBatchModeSchema\)/);
assert.match(paramsBlock, /rounds:\s*Type\.Optional\(\s*Type\.Integer/);
assert.match(paramsBlock, /minimum:\s*1/);
assert.match(paramsBlock, /maximum:\s*5/);
});
test("subagent debate mode injects prior-round transcript", () => {
assert.match(
subagentSrc,
/params\.mode\s*===\s*"debate"/,
"dispatch should branch on mode: debate",
);
assert.match(
subagentSrc,
/const\s+transcriptEntries:\s*string\[\]\s*=\s*\[\]/,
"debate should maintain a transcript across rounds",
);
assert.match(
subagentSrc,
/Debate transcript so far:/,
"debate prompt should include the transcript",
);
assert.match(
subagentSrc,
/final round/i,
"final debate round should ask for a final verdict",
);
});
test("subagent details includes debate as a first-class mode", () => {
assert.match(
subagentSrc,
/type\s+SubagentMode\s*=\s*"single"\s*\|\s*"parallel"\s*\|\s*"debate"\s*\|\s*"chain"/,
);
assert.match(
subagentSrc,
/details\.mode\s*===\s*"parallel"\s*\|\|\s*details\.mode\s*===\s*"debate"/,
"renderer should handle debate mode alongside parallel batches",
);
});