diff --git a/src/resources/extensions/sf/prompts/gate-evaluate.md b/src/resources/extensions/sf/prompts/gate-evaluate.md index 8fc934087..e10e56bb7 100644 --- a/src/resources/extensions/sf/prompts/gate-evaluate.md +++ b/src/resources/extensions/sf/prompts/gate-evaluate.md @@ -25,7 +25,17 @@ You are evaluating **quality gates in parallel** for this slice. Each gate is an 3. **Verify each gate wrote its result** by checking that `sf_save_gate_result` was called for each gate ID. 4. **Report the batch outcome** — which gates passed, which flagged concerns, and which were omitted as not applicable. -Gate agents may return `verdict: "omitted"` if the gate question is not applicable to this slice (e.g., no auth surface for Q3, no existing requirements touched for Q4). This is expected for simple slices. +## Verdict Discipline + +Gate agents return one of `passed`, `failed`, or `omitted`. + +- `passed` — the gate question was answerable for this slice and the answer is "no concern". +- `failed` — the gate question was answerable for this slice and the answer is "concern that must be addressed before execution". +- `omitted` — the gate question is *not applicable* to this slice (e.g., no auth surface for an auth gate; no existing requirements touched for a requirements gate). + +**`omitted` is not a hedge.** It exists for "the question does not apply here", not for "I am not sure". If the gate ran and the question was answerable, the agent must decide `passed` or `failed`. Each `omitted` verdict must include a one-line reason naming what is absent (e.g., "no network calls in this slice"). + +When the batch returns, scan for `omitted` verdicts without a reason. Treat any unexplained `omitted` as failed-to-decide and re-dispatch that gate with an explicit instruction to pick `passed` or `failed`. ## Subagent Prompts diff --git a/src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md b/src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md index 728ebbd5b..4f75f6ee0 100644 --- a/src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md +++ b/src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md @@ -254,6 +254,23 @@ sf_save_memory( ) ``` +## Subagent Prompt Audit + +Before dispatching, audit each subagent prompt. The subagent inherits sf's tool surface but **does not** inherit the parent's mode constraints (autonomous vs auto, ask-gate state) — the prompt is the only thing constraining its behaviour. + +Check the prompt for each of: + +- **Smuggled user-question.** In autonomous mode, the parent is forbidden from `ask_user_questions` (see `bootstrap/ask-gate.ts`). A subagent dispatched with "ask the user about X" sidesteps this. Reject prompts that delegate the user-prompt the parent itself isn't allowed to issue. +- **Action-class delegation.** A prompt that instructs the subagent to perform a destructive action (force-push, drop schema, delete worktree) inherits that action's risk regardless of which agent runs it. If the action would be blocked for the parent, block it at dispatch — don't route around the gate by spawning a child. +- **Scope creep.** A research subagent ("scout") with a prompt that asks it to *also* fix what it finds is two roles in one. Split: scout reports, parent decides whether to dispatch a worker. Mixing roles in one prompt produces unauditable side effects. +- **Tool surface vs prompt mismatch.** A subagent agent definition with `disallowedTools: [Edit, Write]` (read-only role) cannot satisfy a prompt that asks for "fix the bug". Either the agent or the prompt is wrong; reject the dispatch and pick the right pairing. + +When the subagent returns, scan the result for these patterns before treating it as evidence: + +- **Hedge words** ("should be fine", "probably", "looks correct based on my reading"). The subagent rationalised instead of verifying. Re-dispatch with explicit verification requirements. +- **Tool-result errors glossed over.** Search the subagent's reported actions for `Error:`, `failed`, non-zero exit codes that the subagent narrated past. If the subagent treated a tool failure as success, its conclusion is unreliable. +- **Self-reports without traces.** "I ran the tests and they pass" without a Command/Output block is the agentic equivalent of a hand-wave. Trust traces, not summaries. + ## Hard Rules - **Stay inside sf.** Do not shell out to `claude`, `codex`, or any other external coding-agent CLI. sf's own subagent fabric handles all of this with proper trace integration. diff --git a/src/resources/extensions/sf/skills/researcher/SKILL.md b/src/resources/extensions/sf/skills/researcher/SKILL.md index 6cf6e57d0..69a5c52c7 100644 --- a/src/resources/extensions/sf/skills/researcher/SKILL.md +++ b/src/resources/extensions/sf/skills/researcher/SKILL.md @@ -13,6 +13,25 @@ Research a topic using four complementary information sources, in priority order This skill is the first step before planning — it produces the evidence base that drives good decisions. Without research, agents plan from assumptions; with this skill, they plan from evidence. + + +This is a **read-only** skill. Research produces a report; it does not modify the repo, the database, or external state. + +Strictly prohibited while running this skill: + +- File creation, modification, or deletion (no `Write`, `Edit`, `NotebookEdit`, `touch`, `rm`, `mv`, `cp`). +- Bash redirects or heredocs that write files: `>`, `>>`, `tee`, `cat < file`, `python -c "open(...).write(...)"`. The shell back-door does not bypass the read-only contract. +- DB writes: `sqlite3 .sf/sf.db "INSERT|UPDATE|DELETE|DROP|CREATE ..."`. Use `SELECT` only. +- Git write operations: `add`, `commit`, `push`, `merge`, `rebase`, `checkout -b`, `branch -d`. +- Package installs: `npm install`, `pip install`, `cargo add`. +- Spawning subagents that perform any of the above on the researcher's behalf — the prohibition applies to delegated actions as well. + +The single allowed write is the research report itself, written via the parent agent's tools after this skill returns its findings — not by the researcher mid-skill. Communicate findings as a structured report in the response, not by writing intermediate scratch files. + +If a research question genuinely requires a write to answer (e.g., "does X actually work?" requires running a script), surface that as a finding labelled "requires execution" so the parent can dispatch a `worker` subagent — do not perform the write yourself. + + + **Serena MCP (code intelligence — USE FIRST for code exploration):** diff --git a/src/resources/extensions/sf/skills/spec-first-tdd/SKILL.md b/src/resources/extensions/sf/skills/spec-first-tdd/SKILL.md index 2f982e387..780fb1b5a 100644 --- a/src/resources/extensions/sf/skills/spec-first-tdd/SKILL.md +++ b/src/resources/extensions/sf/skills/spec-first-tdd/SKILL.md @@ -39,14 +39,22 @@ Side-chain (any bug/failure during TDD): invoke `systematic-debugging`. When the ## Workflow -### 1. Purpose, Consumer, Value +### 1. PDD packet (all 8 fields) -Before any test, answer: +Before any test, fill the PDD packet — same 8 fields as [`purpose-driven-development`](../purpose-driven-development/SKILL.md). Save to `.sf/active/{unit-id}/pdd.md`. The reviewer (`requesting-code-review`) reads from this same packet, so what you write here is what they review. - **Purpose:** why this behaviour exists. - **Consumer:** which production code path depends on it (`rg -n "fnName" src/ packages/`). If you can't name a real caller, stop. -- **Value at risk:** what breaks (and who notices) if it returns the wrong answer. -- **Out of scope:** what this slice explicitly does *not* cover. +- **Contract:** what observable behaviour proves success — what the consumer receives, not the internal mechanics. +- **Failure boundary:** what *correct failure* looks like if the purpose can't be fulfilled — degrade, surface, do not swallow. +- **Evidence:** the test or metric that proves the contract. Must be machine-executable (named test file, queryable metric, runnable command) or `[MANUAL: reviewer + scenario]`. Prose-only is rejected. +- **Non-goals:** what this slice explicitly does *not* cover. +- **Invariants:** what must remain true. If the change touches async / queues / timers / state machines: split safety ("X never happens") + liveness ("Y eventually happens"). Pure synchronous code may use safety-only. +- **Assumptions:** what conditions about the world MUST be true for this spec to be valid — locking protocols, API stability, caller invariants, deployment context, data shape. + +Also useful to flag for downstream review (not part of the PDD packet, but used by reviewers to calibrate scrutiny): + +- **Value at risk:** what breaks (and who notices) if the contract returns the wrong answer. Label evidence: @@ -187,3 +195,14 @@ Never fix code without a failing test first. The test is the proof that the bug - `npm run typecheck:extensions` clean. - For non-trivial runtime/provider fixes: explicit repro before code, solved boundary after. - Updated active-slice artifact when one exists in `.sf/active/{unit-id}/`. + +Each evidence claim must include a Command/Output trace: + +``` +### Check: +**Command run:** +**Output observed:** +**Result: PASS** (or FAIL — with Expected vs Actual) +``` + +Reading test source and asserting "this should pass" is not evidence. Run it; paste the output; then claim PASS. diff --git a/src/resources/extensions/sf/skills/systematic-debugging/SKILL.md b/src/resources/extensions/sf/skills/systematic-debugging/SKILL.md index de3a6dc14..1eeecf1bd 100644 --- a/src/resources/extensions/sf/skills/systematic-debugging/SKILL.md +++ b/src/resources/extensions/sf/skills/systematic-debugging/SKILL.md @@ -14,6 +14,18 @@ NO NON-TRIVIAL FIX WITHOUT AN EXPLICIT REPRO. If you haven't completed Phase 1, you cannot propose fixes. +## Recognize Your Own Rationalizations + +Debugging is where the LLM's verification failure modes hit hardest. You will feel the urge to skip evidence collection and jump to a fix. These are the exact excuses you reach for — recognize and invert each: + +- "I bet I know what this is — let me just try the fix." → No. The fix you have in mind is a hypothesis. Phase 1 (Evidence) and Phase 2 (Root Cause) are not optional. +- "The error message obviously says it's X." → Error messages name the symptom. They rarely name the root cause. Read the stack trace; find the line; verify the call site. +- "The test was probably flaky — let me re-run it." → Re-running a test that failed is not a fix; it is hoping the bug went away. If a re-run greens, the failure was not flaky — it was timing-dependent or order-dependent, which is a real bug. +- "The diff is small, so the change can't be the cause." → Small diffs ship the most subtle regressions. Confirm via `git log` / `git bisect`, not by ruling out by size. +- "It works on my machine." → "Your machine" is one execution. If the failure is reproducible, reproduce it; if not, capture the environment difference *as a finding*, not as a dismissal. + +If you catch yourself proposing a fix before Phase 1 evidence is in writing — stop. Read the logs, the trace, the event log. Then decide. + ## When to Run - A test fails, RED happens for the wrong reason, GREEN regresses, build fails, lint fails. @@ -109,6 +121,19 @@ Use `spec-first-tdd` for the contract test if the bug crosses a non-trivial boun 4. For non-trivial fixes: re-run the same repro/debug path; capture the **solved** boundary output. 5. For runtime fixes: tail the event log and confirm the error class disappears. +### Trace format for verification + +Every claimed PASS in the slice artefact must be backed by a trace — paraphrase is not evidence: + +``` +### Check: +**Command run:** +**Output observed:** +**Result: PASS** (or FAIL — with Expected vs Actual) +``` + +A check without a `Command run` block is a skip. "I re-ran the repro and it works now" without the command and the output is a self-report, not verification. + Persist the pattern to memory so future units don't re-hit it: ```