# sf Spec-First TDD The change-method constitution for sf. Terse and procedural — optimized for agent retrieval. It operationalizes [ADR-0000: SF Is a Purpose-to-Software Compiler](./adr/0000-purpose-to-software-compiler.md). ## Purpose Every change in sf must: 1. solve a real system need 2. preserve or increase system value 3. clarify behavior before implementation 4. make tests define the contract 5. find and close gaps in what already exists Priority: **purpose > value > contract > working code**. If purpose and value are clear but implementation is uncertain, write contract tests first and align code to them. ## Iron Law ``` THE TEST IS THE SPEC. THE JSDOC IS THE PURPOSE. CODE EXISTS TO FULFILL PURPOSE. NO BEHAVIOR CHANGE WITHOUT A FAILING TEST FIRST. NO COMPLETION WITHOUT A REAL CONSUMER. NO JUDGMENT CALL WITHOUT A CONFIDENCE AND FALSIFIER. ``` **The test is the spec** — not verification of the spec. Tests describe what the software MUST do, not what it happens to do. A test that mirrors implementation rubber-stamps bugs. **The JSDoc is the purpose** — every exported function, type, and class opens with a one-line `Purpose:` statement. If you can't write the purpose before the code, you don't know what you're building. Purpose drives what the test asserts. Code without a stated purpose cannot be verified. **Code exists to fulfill purpose** — not to compile, not to pass lint, not to look clean. Quality measure: does it satisfy the purpose (JSDoc) as verified by the spec (test)? Code that compiles but doesn't serve its stated purpose is a bug. ### Purposeful tests vs. mechanical tests | Kind | Asserts | Survives refactor? | |---|---|---| | **Purposeful** | "claim() returns rows_affected=1 only when the lease was free or expired" | yes | | **Mechanical** | `mockDb.update.calls.length === 1` | no | Write purposeful tests first. They are the spec. A different implementation that passes them is equally correct. Add mechanical tests only as labelled implementation guards for specific failure modes (resource leaks, infinite loops). ### Three-tier test organization 1. **Behaviour contracts** (primary) — what the consumer receives. The spec. 2. **Degradation contracts** — what happens when dependencies fail. Consumer must always get a useful response; failure must degrade, not crash. 3. **Implementation guards** (secondary, labelled) — protect against specific failure modes. A refactor that changes internals updates guards, not behaviour contracts. ## Decomposition Path `Vision (SPEC.md / VISION.md) → Milestone → Slice → Task → contract test → code → evidence` Reject: `prompt → files → hope`. Every unit (milestone, slice, task) sits in one of those rows. If a piece of work doesn't, it is unspecified. ## Purpose Gate Every artifact (slice plan, task plan, function, test, ADR) must answer the same 8 PDD fields captured by the [`purpose-driven-development`](../src/resources/extensions/sf/skills/purpose-driven-development/SKILL.md) skill — these fields ARE the Purpose Gate: - **Purpose**: why this behaviour exists. - **Consumer**: who depends on the outcome in production (real caller, not just tests). - **Contract**: what observable behaviour proves success — what the consumer receives, not how the implementation works internally. - **Failure boundary**: what *correct failure* looks like if the purpose can't be fulfilled — degrade, surface, do not swallow. - **Evidence**: the test, metric, or repro that proves the contract. Each criterion must be machine-executable (named test, queryable metric, runnable command) OR explicitly tagged `[MANUAL: reviewer + scenario]`. Prose-only evidence is unfalsifiable and rejected. - **Non-goals**: what this is *not* solving. - **Invariants**: what must remain true. If the change touches async, queues, timers, or state machines, split into safety ("X never happens") + liveness ("Y eventually happens"). Pure synchronous code may use safety-only. - **Assumptions**: conditions about the world that MUST be true for this spec to be valid — locking protocols, API stability, caller invariants, deployment context, data shape. World-side failures (assumption violated) are invisible to internal tests and are the most expensive failure class. If any field is missing: `BLOCKED: purpose unclear — [which field is missing]`. Do not invent a plausible answer to proceed. Surfacing the gap is more valuable than rationalising past it. Treat the contract as a **falsifiable hypothesis**: name the evidence that would prove it wrong before implementation locks in. A contract without a falsifier is half a contract. ## Workflow (mapped to sf's phase machine) ### Research phase — name the problem Before any plan: - Where does this sit in `SPEC.md` / `VISION.md` / `REQUIREMENTS.md`? - Why is it useful, who needs it, what does it enable? - What breaks if wrong, what is out of scope? For brownfield changes, **consumer discovery precedes purpose articulation.** Use `rg` / `git grep` to find real callers — never assume. You cannot reason about "what breaks" until you know who calls the code. ```bash rg -nF "functionName" src/ packages/ --type=ts git grep -n "functionName" ``` If you can't name a real consumer, stop. Don't add code yet. ### Plan phase — clarify before deciding Clarify highest-impact unknowns first: behaviour, acceptance criteria, data invariants, failure handling, security, integration boundaries. For non-trivial contracts, pressure-test before locking the plan via the [`advisory-partner`](../src/resources/extensions/sf/skills/advisory-partner/SKILL.md) skill — this is sf's adversarial review surface, already wired into the Q3/Q4 gates and `validate-milestone`. It runs with the **validation** model, distinct from the planning/execution model — that's the point. 1. **Advocate pass** — strengthen the best version of the contract. 2. **Challenger pass** — attack assumptions AND propose an alternative. A challenger anchored to the advocate's framing is not adversarial. 3. **Falsifier (required gate, blocks Plan→Execute):** `FALSIFIER: this contract is wrong if [specific observable condition].` Generic falsifiers ("wrong if it doesn't work") are process failures. **Find the devil and find the experts:** - **Devil** — finds the specific failure that compounds silently: wrong assumption → wrong test → wrong code → wrong evidence, all passing. - **Experts** — domain specialists who know what right looks like. Pick expertise matching the decision: SRE (reliability), security (trust boundary), distributed systems (consistency), API reviewer (ergonomics). Both forces must act on the contract before it becomes tests. One strong pass each, unless concrete risk remains. ### Plan from contracts, not files **Purpose re-check:** restate purpose from the Research step in one sentence. If the plan now serves a different purpose, the contract drifted — go back. Each behaviour slice defines: consumer, contract, code path, validation, falsifier. | Good | Bad | |---|---| | Add failing test proving `claim()` rejects expired-lease takeover when `claim_until > now()`. | Edit `src/resources/extensions/sf/auto-dispatch.ts`. | ### TDD phase — write the test first 1. Write the failing test. 2. Make it fail for the **right** reason (feature missing, not typo). 3. Only then write production code. **Purpose re-check:** does this test prove behaviour serving the stated purpose? Test types: | Behaviour | Test type | |---|---| | Pure logic, local invariants | Unit | | Interface/schema contracts | Contract | | Storage, orchestration, multi-component | Integration | | Existing behaviour you must preserve | Characterisation | | State machines, routing, normalisation | Property/invariant | Test naming: `test___` or describe-blocks structured the same way. The name **is** the contract claim. ``` npm run test:unit -- path/to/file.test.ts ``` If it passes immediately, you're testing existing behaviour. Fix the test. ### Execute phase — minimal production code Smallest change that makes the spec (test) green while serving the purpose (JSDoc). Nothing more. No YAGNI violations, no surrounding cleanup. Do not weaken the test to fit sloppy code — fix the code. Code that compiles and passes lint but doesn't fulfil its stated purpose is a bug. ### Verify phase — green, lint, type-check ```bash npm run typecheck:extensions npm test ``` All tests green. Zero lint/type errors. Then refactor while green. ### Review phase — verify usefulness **Purpose re-check (final):** does the code serve a real production consumer? Verify: who calls it (`rg` for usages), what production path depends on it, what signal would reveal breakage. **If only tests call it, it is not finished or not needed.** **Falsifier follow-through:** re-check the falsifier from the Plan phase. If the falsifier is observable post-deploy, add it to monitoring or to the unit's verification commands. A falsifier that is never checked after deploy is half a contract. **Zero callers ≠ zero purpose.** Before deleting: does it serve an unmet need (wire it in) or is it superseded (delete it)? Never test for absence of old code — test that new behaviour works. ### Confidence Gate (between phases) After completing a step, state confidence as a number `0.0–1.0` and a one-line reason. The number forces a pause to assess rather than plowing ahead on momentum. | Step | Threshold | Below threshold | |---|---|---| | Purpose & consumer | 0.95 | Run an adversarial review wave (advisory-partner Q3/Q5). | | Contract test | 0.90 | Adversarial review wave. | | Implementation | 0.95 | Add a specialist reviewer for the touched boundary (e.g. provider/transport/security). | | Final evidence | 0.97 | Full adversarial: advocate + challenger + specialist. | Skip the gate for trivial steps (typo fix, exhaustive matches with full coverage). The gate earns its keep on I/O boundaries, async loading, protocol integration, and anything touching real backends or models. LLM confidence numbers are poorly calibrated in absolute terms — the *relative* signal matters. If you write 0.7, you know you're guessing. Act on that. ## Tests Find Gaps Testing existing code is one of the highest-value activities sf can do. A test that reveals an existing gap is more valuable than one validating new code — the gap was compounding in production. High-value gap tests: - **Purpose** — does this module do what its JSDoc claims? - **Fallback** — does failure surface or get masked? - **Persistence** — does state survive restart? (especially `.sf/sf.db`, `.sf/runtime/*.json`) - **Boundary** — what happens at empty input, max value, network partition, expired claim? - **Contract** — does the caller get what it expects? When a test fails against existing code, fix the code. The test told you what was broken. 50 tested features > 500 untested ones. ## Test Rules - **Test first.** Without it, you mirror implementation — bugs and all. - **Bug = missing correct-behaviour test.** Write a test for the *correct* behaviour first; it must fail (RED) because the bug exists. If it passes immediately, the test is wrong (testing the broken behaviour) — fix the test, not the code. - **Bug reports → failing regression test first.** - **Behaviour change without tests is incomplete.** - **Bad tests produce bad code.** A test validating silent failure is wrong — rewrite it. - **Test through the public contract.** Don't expose `_helpers` for testability; assert through real callers. - **Test pin behaviour, not internal decomposition.** A test that breaks on refactor without behaviour change is mechanical, not purposeful. - **Critical invariants may need property tests, not just examples** (e.g. ULID monotonicity, claim race, idempotent migrations). - **Fix code to satisfy live-contract tests. Fix or delete tests encoding stale behaviour.** - **Fallbacks must deliver working behaviour or not exist.** A fallback that silently returns nothing is worse than none. ## Test Boundaries - Test through the public contract that production consumers use. - Do not promote `_helper` to `helper` for testing convenience. - Assert through public methods, not implementation detail. - Tests pin behaviour, not internal decomposition. - For Node.js native test runner: `async` test functions and `await`; never call `.then()`/`.catch()` chains in test bodies when `await` expresses the same contract. ## Self-Modification Boundary sf modifies its own codebase via the auto-loop. Without a protected zone, constitutional drift is silent. **Protected files (human approval required):** `SPEC.md`, `BUILD_PLAN.md`, `UPSTREAM_PORT_GUIDE.md`, `AGENTS.md`, `CLAUDE.md`, `CONTRIBUTING.md`, `docs/SPEC_FIRST_TDD.md`, every `docs/dev/ADR-*.md`. Autonomous agents may propose changes but must not merge to these without human review. **Test infrastructure** (`tests/`, `*.test.ts`, `tsconfig*.json`, lint config) requires advocate/challenger/falsifier — a change to test infra can make all future tests pass vacuously. Treat test-infra changes as governance-adjacent: they alter the validity of every test that runs after them. A corrupted test runner is more dangerous than a corrupted test. ## Evidence Required for production-impacting changes: - failing test → passing test → type-check → lint - advocate's strongest support, challenger's strongest opposition, falsifier + outcome - runtime evidence: traces (`.sf/traces/`), event log (`.sf/event-log.jsonl`), gate results - for non-trivial runtime/provider fixes: explicit repro before code, solved boundary after code Persist learning: when a unit produces a gotcha or anti-pattern, write to sf's memory store (`memories` table) so the next unit sees it. Evidence that only lives in the conversation dies on restart. ## Degraded Operation | Dependency down | Behaviour | |---|---| | Native engine (`forge_engine.node`) | Fall back to JS implementations; log degraded mode. Never silently proceed without confirming fallback path is wired. | | `node:sqlite` and `better-sqlite3` both unavailable | Filesystem-derived state (no DB); log degraded discovery. Block any operation that requires durable state. | | LLM provider | Try next allowed provider per `~/.sf/preferences.md`; if exhausted, halt unit with `ErrModelUnavailable` (no silent skip). | | SOPS unavailable | Use already-exported env vars; log that secret refresh is unavailable. Block secret-touching commands. | When a dependency is down: operate in defined degraded mode or stop. Never silently proceed. ## Task Template Each task: **Purpose** (need + why) → **Consumer** (who depends) → **Contract** (test proving it) → **Implementation** (code changes) → **Evidence** (test + lint + runtime signal). If a task cannot be described this way, it is underspecified. ## See Also - [`AGENTS.md`](../AGENTS.md) — repo guidelines, build/test/lint commands. - [`SPEC.md`](../SPEC.md) — sf v3 specification (what we're building). - [`UPSTREAM_PORT_GUIDE.md`](../UPSTREAM_PORT_GUIDE.md) — porting from pi-mono legacy port. - [`src/resources/extensions/sf/skills/advisory-partner/SKILL.md`](../src/resources/extensions/sf/skills/advisory-partner/SKILL.md) — adversarial review framework. - [`src/resources/extensions/sf/skills/code-review/SKILL.md`](../src/resources/extensions/sf/skills/code-review/SKILL.md) — multi-lens review skill. ## References - GitHub Spec Kit — spec-first authoring patterns. - Ousterhout, *A Philosophy of Software Design* — deep modules, contract pattern. - Trail of Bits — anti-rationalisation rules. - ACE — original Iron Law / Purpose Gate framing this doc adapts.