274 lines
14 KiB
Markdown
274 lines
14 KiB
Markdown
# sf Spec-First TDD
|
||
|
||
The change-method constitution for sf. Terse and procedural — optimized for agent retrieval.
|
||
|
||
## Purpose
|
||
|
||
Every change in sf must:
|
||
|
||
1. solve a real system need
|
||
2. preserve or increase system value
|
||
3. clarify behavior before implementation
|
||
4. make tests define the contract
|
||
5. find and close gaps in what already exists
|
||
|
||
Priority: **purpose > value > contract > working code**.
|
||
|
||
If purpose and value are clear but implementation is uncertain, write contract tests first and align code to them.
|
||
|
||
## Iron Law
|
||
|
||
```
|
||
THE TEST IS THE SPEC. THE JSDOC IS THE PURPOSE. CODE EXISTS TO FULFILL PURPOSE.
|
||
|
||
NO BEHAVIOR CHANGE WITHOUT A FAILING TEST FIRST.
|
||
NO COMPLETION WITHOUT A REAL CONSUMER.
|
||
NO JUDGMENT CALL WITHOUT A CONFIDENCE AND FALSIFIER.
|
||
```
|
||
|
||
**The test is the spec** — not verification of the spec. Tests describe what the software MUST do, not what it happens to do. A test that mirrors implementation rubber-stamps bugs.
|
||
|
||
**The JSDoc is the purpose** — every exported function, type, and class opens with a one-line `Purpose:` statement. If you can't write the purpose before the code, you don't know what you're building. Purpose drives what the test asserts. Code without a stated purpose cannot be verified.
|
||
|
||
**Code exists to fulfill purpose** — not to compile, not to pass lint, not to look clean. Quality measure: does it satisfy the purpose (JSDoc) as verified by the spec (test)? Code that compiles but doesn't serve its stated purpose is a bug.
|
||
|
||
### Purposeful tests vs. mechanical tests
|
||
|
||
| Kind | Asserts | Survives refactor? |
|
||
|---|---|---|
|
||
| **Purposeful** | "claim() returns rows_affected=1 only when the lease was free or expired" | yes |
|
||
| **Mechanical** | `mockDb.update.calls.length === 1` | no |
|
||
|
||
Write purposeful tests first. They are the spec. A different implementation that passes them is equally correct. Add mechanical tests only as labelled implementation guards for specific failure modes (resource leaks, infinite loops).
|
||
|
||
### Three-tier test organization
|
||
|
||
1. **Behaviour contracts** (primary) — what the consumer receives. The spec.
|
||
2. **Degradation contracts** — what happens when dependencies fail. Consumer must always get a useful response; failure must degrade, not crash.
|
||
3. **Implementation guards** (secondary, labelled) — protect against specific failure modes. A refactor that changes internals updates guards, not behaviour contracts.
|
||
|
||
## Decomposition Path
|
||
|
||
`Vision (SPEC.md / VISION.md) → Milestone → Slice → Task → contract test → code → evidence`
|
||
|
||
Reject: `prompt → files → hope`.
|
||
|
||
Every unit (milestone, slice, task) sits in one of those rows. If a piece of work doesn't, it is unspecified.
|
||
|
||
## Purpose Gate
|
||
|
||
Every artifact (slice plan, task plan, function, test, ADR) must answer:
|
||
|
||
- **why** this behaviour exists
|
||
- **what value** it creates or protects
|
||
- **who** uses it in production (real consumer, not just tests)
|
||
- **what breaks** if it returns the wrong answer
|
||
|
||
If any answer is missing: `BLOCKED: purpose unclear — [which field is missing]`. Do not invent a plausible purpose to proceed. Surfacing the gap is more valuable than rationalising past it.
|
||
|
||
Treat the contract as a **falsifiable hypothesis**: name the evidence that would prove it wrong before implementation locks in. A contract without a falsifier is half a contract.
|
||
|
||
## Workflow (mapped to sf's phase machine)
|
||
|
||
### Research phase — name the problem
|
||
|
||
Before any plan:
|
||
|
||
- Where does this sit in `SPEC.md` / `VISION.md` / `REQUIREMENTS.md`?
|
||
- Why is it useful, who needs it, what does it enable?
|
||
- What breaks if wrong, what is out of scope?
|
||
|
||
For brownfield changes, **consumer discovery precedes purpose articulation.** Use `rg` / `git grep` to find real callers — never assume. You cannot reason about "what breaks" until you know who calls the code.
|
||
|
||
```bash
|
||
rg -nF "functionName" src/ packages/ --type=ts
|
||
git grep -n "functionName"
|
||
```
|
||
|
||
If you can't name a real consumer, stop. Don't add code yet.
|
||
|
||
### Plan phase — clarify before deciding
|
||
|
||
Clarify highest-impact unknowns first: behaviour, acceptance criteria, data invariants, failure handling, security, integration boundaries.
|
||
|
||
For non-trivial contracts, pressure-test before locking the plan via the [`advisory-partner`](../src/resources/extensions/sf/skills/advisory-partner/SKILL.md) skill — this is sf's adversarial review surface, already wired into the Q3/Q4 gates and `validate-milestone`. It runs with the **validation** model, distinct from the planning/execution model — that's the point.
|
||
|
||
1. **Advocate pass** — strengthen the best version of the contract.
|
||
2. **Challenger pass** — attack assumptions AND propose an alternative. A challenger anchored to the advocate's framing is not adversarial.
|
||
3. **Falsifier (required gate, blocks Plan→Execute):** `FALSIFIER: this contract is wrong if [specific observable condition].` Generic falsifiers ("wrong if it doesn't work") are process failures.
|
||
|
||
**Find the devil and find the experts:**
|
||
|
||
- **Devil** — finds the specific failure that compounds silently: wrong assumption → wrong test → wrong code → wrong evidence, all passing.
|
||
- **Experts** — domain specialists who know what right looks like. Pick expertise matching the decision: SRE (reliability), security (trust boundary), distributed systems (consistency), API reviewer (ergonomics).
|
||
|
||
Both forces must act on the contract before it becomes tests. One strong pass each, unless concrete risk remains.
|
||
|
||
### Plan from contracts, not files
|
||
|
||
**Purpose re-check:** restate purpose from the Research step in one sentence. If the plan now serves a different purpose, the contract drifted — go back.
|
||
|
||
Each behaviour slice defines: consumer, contract, code path, validation, falsifier.
|
||
|
||
| Good | Bad |
|
||
|---|---|
|
||
| Add failing test proving `claim()` rejects expired-lease takeover when `claim_until > now()`. | Edit `src/resources/extensions/sf/auto-dispatch.ts`. |
|
||
|
||
### TDD phase — write the test first
|
||
|
||
1. Write the failing test.
|
||
2. Make it fail for the **right** reason (feature missing, not typo).
|
||
3. Only then write production code.
|
||
|
||
**Purpose re-check:** does this test prove behaviour serving the stated purpose?
|
||
|
||
Test types:
|
||
|
||
| Behaviour | Test type |
|
||
|---|---|
|
||
| Pure logic, local invariants | Unit |
|
||
| Interface/schema contracts | Contract |
|
||
| Storage, orchestration, multi-component | Integration |
|
||
| Existing behaviour you must preserve | Characterisation |
|
||
| State machines, routing, normalisation | Property/invariant |
|
||
|
||
Test naming: `test_<what>_<when>_<expected>` or describe-blocks structured the same way. The name **is** the contract claim.
|
||
|
||
```
|
||
node --test --experimental-test-isolation=process dist-test/path/to/file.test.js
|
||
```
|
||
|
||
If it passes immediately, you're testing existing behaviour. Fix the test.
|
||
|
||
### Execute phase — minimal production code
|
||
|
||
Smallest change that makes the spec (test) green while serving the purpose (JSDoc). Nothing more. No YAGNI violations, no surrounding cleanup.
|
||
|
||
Do not weaken the test to fit sloppy code — fix the code. Code that compiles and passes lint but doesn't fulfil its stated purpose is a bug.
|
||
|
||
### Verify phase — green, lint, type-check
|
||
|
||
```bash
|
||
npm run typecheck:extensions
|
||
npm test
|
||
```
|
||
|
||
All tests green. Zero lint/type errors. Then refactor while green.
|
||
|
||
### Review phase — verify usefulness
|
||
|
||
**Purpose re-check (final):** does the code serve a real production consumer?
|
||
|
||
Verify: who calls it (`rg` for usages), what production path depends on it, what signal would reveal breakage. **If only tests call it, it is not finished or not needed.**
|
||
|
||
**Falsifier follow-through:** re-check the falsifier from the Plan phase. If the falsifier is observable post-deploy, add it to monitoring or to the unit's verification commands. A falsifier that is never checked after deploy is half a contract.
|
||
|
||
**Zero callers ≠ zero purpose.** Before deleting: does it serve an unmet need (wire it in) or is it superseded (delete it)? Never test for absence of old code — test that new behaviour works.
|
||
|
||
### Confidence Gate (between phases)
|
||
|
||
After completing a step, state confidence as a number `0.0–1.0` and a one-line reason. The number forces a pause to assess rather than plowing ahead on momentum.
|
||
|
||
| Step | Threshold | Below threshold |
|
||
|---|---|---|
|
||
| Purpose & consumer | 0.95 | Run an adversarial review wave (advisory-partner Q3/Q5). |
|
||
| Contract test | 0.90 | Adversarial review wave. |
|
||
| Implementation | 0.95 | Add a specialist reviewer for the touched boundary (e.g. provider/transport/security). |
|
||
| Final evidence | 0.97 | Full adversarial: advocate + challenger + specialist. |
|
||
|
||
Skip the gate for trivial steps (typo fix, exhaustive matches with full coverage). The gate earns its keep on I/O boundaries, async loading, protocol integration, and anything touching real backends or models.
|
||
|
||
LLM confidence numbers are poorly calibrated in absolute terms — the *relative* signal matters. If you write 0.7, you know you're guessing. Act on that.
|
||
|
||
## Tests Find Gaps
|
||
|
||
Testing existing code is one of the highest-value activities sf can do. A test that reveals an existing gap is more valuable than one validating new code — the gap was compounding in production.
|
||
|
||
High-value gap tests:
|
||
|
||
- **Purpose** — does this module do what its JSDoc claims?
|
||
- **Fallback** — does failure surface or get masked?
|
||
- **Persistence** — does state survive restart? (especially `.sf/sf.db`, `.sf/runtime/*.json`)
|
||
- **Boundary** — what happens at empty input, max value, network partition, expired claim?
|
||
- **Contract** — does the caller get what it expects?
|
||
|
||
When a test fails against existing code, fix the code. The test told you what was broken.
|
||
|
||
50 tested features > 500 untested ones.
|
||
|
||
## Test Rules
|
||
|
||
- **Test first.** Without it, you mirror implementation — bugs and all.
|
||
- **Bug = missing correct-behaviour test.** Write a test for the *correct* behaviour first; it must fail (RED) because the bug exists. If it passes immediately, the test is wrong (testing the broken behaviour) — fix the test, not the code.
|
||
- **Bug reports → failing regression test first.**
|
||
- **Behaviour change without tests is incomplete.**
|
||
- **Bad tests produce bad code.** A test validating silent failure is wrong — rewrite it.
|
||
- **Test through the public contract.** Don't expose `_helpers` for testability; assert through real callers.
|
||
- **Test pin behaviour, not internal decomposition.** A test that breaks on refactor without behaviour change is mechanical, not purposeful.
|
||
- **Critical invariants may need property tests, not just examples** (e.g. ULID monotonicity, claim race, idempotent migrations).
|
||
- **Fix code to satisfy live-contract tests. Fix or delete tests encoding stale behaviour.**
|
||
- **Fallbacks must deliver working behaviour or not exist.** A fallback that silently returns nothing is worse than none.
|
||
|
||
## Test Boundaries
|
||
|
||
- Test through the public contract that production consumers use.
|
||
- Do not promote `_helper` to `helper` for testing convenience.
|
||
- Assert through public methods, not implementation detail.
|
||
- Tests pin behaviour, not internal decomposition.
|
||
- For Node.js native test runner: `async` test functions and `await`; never call `.then()`/`.catch()` chains in test bodies when `await` expresses the same contract.
|
||
|
||
## Self-Modification Boundary
|
||
|
||
sf modifies its own codebase via the auto-loop. Without a protected zone, constitutional drift is silent.
|
||
|
||
**Protected files (human approval required):**
|
||
`SPEC.md`, `BUILD_PLAN.md`, `UPSTREAM_PORT_GUIDE.md`, `AGENTS.md`, `CLAUDE.md`, `CONTRIBUTING.md`, `docs/SPEC_FIRST_TDD.md`, every `docs/dev/ADR-*.md`.
|
||
|
||
Autonomous agents may propose changes but must not merge to these without human review.
|
||
|
||
**Test infrastructure** (`tests/`, `*.test.ts`, `tsconfig*.json`, lint config) requires advocate/challenger/falsifier — a change to test infra can make all future tests pass vacuously. Treat test-infra changes as governance-adjacent: they alter the validity of every test that runs after them. A corrupted test runner is more dangerous than a corrupted test.
|
||
|
||
## Evidence
|
||
|
||
Required for production-impacting changes:
|
||
|
||
- failing test → passing test → type-check → lint
|
||
- advocate's strongest support, challenger's strongest opposition, falsifier + outcome
|
||
- runtime evidence: traces (`.sf/traces/`), event log (`.sf/event-log.jsonl`), gate results
|
||
- for non-trivial runtime/provider fixes: explicit repro before code, solved boundary after code
|
||
|
||
Persist learning: when a unit produces a gotcha or anti-pattern, write to sf's memory store (`memories` table) so the next unit sees it. Evidence that only lives in the conversation dies on restart.
|
||
|
||
## Degraded Operation
|
||
|
||
| Dependency down | Behaviour |
|
||
|---|---|
|
||
| Native engine (`forge_engine.node`) | Fall back to JS implementations; log degraded mode. Never silently proceed without confirming fallback path is wired. |
|
||
| `node:sqlite` and `better-sqlite3` both unavailable | Filesystem-derived state (no DB); log degraded discovery. Block any operation that requires durable state. |
|
||
| LLM provider | Try next allowed provider per `~/.sf/preferences.md`; if exhausted, halt unit with `ErrModelUnavailable` (no silent skip). |
|
||
| SOPS unavailable | Use already-exported env vars; log that secret refresh is unavailable. Block secret-touching commands. |
|
||
|
||
When a dependency is down: operate in defined degraded mode or stop. Never silently proceed.
|
||
|
||
## Task Template
|
||
|
||
Each task:
|
||
|
||
**Purpose** (need + why) → **Consumer** (who depends) → **Contract** (test proving it) → **Implementation** (code changes) → **Evidence** (test + lint + runtime signal).
|
||
|
||
If a task cannot be described this way, it is underspecified.
|
||
|
||
## See Also
|
||
|
||
- [`AGENTS.md`](../AGENTS.md) — repo guidelines, build/test/lint commands.
|
||
- [`SPEC.md`](../SPEC.md) — sf v3 specification (what we're building).
|
||
- [`UPSTREAM_PORT_GUIDE.md`](../UPSTREAM_PORT_GUIDE.md) — porting from pi-mono legacy port.
|
||
- [`src/resources/extensions/sf/skills/advisory-partner/SKILL.md`](../src/resources/extensions/sf/skills/advisory-partner/SKILL.md) — adversarial review framework.
|
||
- [`src/resources/extensions/sf/skills/code-review/SKILL.md`](../src/resources/extensions/sf/skills/code-review/SKILL.md) — multi-lens review skill.
|
||
|
||
## References
|
||
|
||
- GitHub Spec Kit — spec-first authoring patterns.
|
||
- Ousterhout, *A Philosophy of Software Design* — deep modules, contract pattern.
|
||
- Trail of Bits — anti-rationalisation rules.
|
||
- ACE — original Iron Law / Purpose Gate framing this doc adapts.
|