diff --git a/src/resources/extensions/sf/prompts/gate-evaluate.md b/src/resources/extensions/sf/prompts/gate-evaluate.md
index 8fc934087..e10e56bb7 100644
--- a/src/resources/extensions/sf/prompts/gate-evaluate.md
+++ b/src/resources/extensions/sf/prompts/gate-evaluate.md
@@ -25,7 +25,17 @@ You are evaluating **quality gates in parallel** for this slice. Each gate is an
3. **Verify each gate wrote its result** by checking that `sf_save_gate_result` was called for each gate ID.
4. **Report the batch outcome** — which gates passed, which flagged concerns, and which were omitted as not applicable.
-Gate agents may return `verdict: "omitted"` if the gate question is not applicable to this slice (e.g., no auth surface for Q3, no existing requirements touched for Q4). This is expected for simple slices.
+## Verdict Discipline
+
+Gate agents return one of `passed`, `failed`, or `omitted`.
+
+- `passed` — the gate question was answerable for this slice and the answer is "no concern".
+- `failed` — the gate question was answerable for this slice and the answer is "concern that must be addressed before execution".
+- `omitted` — the gate question is *not applicable* to this slice (e.g., no auth surface for an auth gate; no existing requirements touched for a requirements gate).
+
+**`omitted` is not a hedge.** It exists for "the question does not apply here", not for "I am not sure". If the gate ran and the question was answerable, the agent must decide `passed` or `failed`. Each `omitted` verdict must include a one-line reason naming what is absent (e.g., "no network calls in this slice").
+
+When the batch returns, scan for `omitted` verdicts without a reason. Treat any unexplained `omitted` as failed-to-decide and re-dispatch that gate with an explicit instruction to pick `passed` or `failed`.
## Subagent Prompts
diff --git a/src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md b/src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md
index 728ebbd5b..4f75f6ee0 100644
--- a/src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md
+++ b/src/resources/extensions/sf/skills/dispatching-subagents/SKILL.md
@@ -254,6 +254,23 @@ sf_save_memory(
)
```
+## Subagent Prompt Audit
+
+Before dispatching, audit each subagent prompt. The subagent inherits sf's tool surface but **does not** inherit the parent's mode constraints (autonomous vs auto, ask-gate state) — the prompt is the only thing constraining its behaviour.
+
+Check the prompt for each of:
+
+- **Smuggled user-question.** In autonomous mode, the parent is forbidden from `ask_user_questions` (see `bootstrap/ask-gate.ts`). A subagent dispatched with "ask the user about X" sidesteps this. Reject prompts that delegate the user-prompt the parent itself isn't allowed to issue.
+- **Action-class delegation.** A prompt that instructs the subagent to perform a destructive action (force-push, drop schema, delete worktree) inherits that action's risk regardless of which agent runs it. If the action would be blocked for the parent, block it at dispatch — don't route around the gate by spawning a child.
+- **Scope creep.** A research subagent ("scout") with a prompt that asks it to *also* fix what it finds is two roles in one. Split: scout reports, parent decides whether to dispatch a worker. Mixing roles in one prompt produces unauditable side effects.
+- **Tool surface vs prompt mismatch.** A subagent agent definition with `disallowedTools: [Edit, Write]` (read-only role) cannot satisfy a prompt that asks for "fix the bug". Either the agent or the prompt is wrong; reject the dispatch and pick the right pairing.
+
+When the subagent returns, scan the result for these patterns before treating it as evidence:
+
+- **Hedge words** ("should be fine", "probably", "looks correct based on my reading"). The subagent rationalised instead of verifying. Re-dispatch with explicit verification requirements.
+- **Tool-result errors glossed over.** Search the subagent's reported actions for `Error:`, `failed`, non-zero exit codes that the subagent narrated past. If the subagent treated a tool failure as success, its conclusion is unreliable.
+- **Self-reports without traces.** "I ran the tests and they pass" without a Command/Output block is the agentic equivalent of a hand-wave. Trust traces, not summaries.
+
## Hard Rules
- **Stay inside sf.** Do not shell out to `claude`, `codex`, or any other external coding-agent CLI. sf's own subagent fabric handles all of this with proper trace integration.
diff --git a/src/resources/extensions/sf/skills/researcher/SKILL.md b/src/resources/extensions/sf/skills/researcher/SKILL.md
index 6cf6e57d0..69a5c52c7 100644
--- a/src/resources/extensions/sf/skills/researcher/SKILL.md
+++ b/src/resources/extensions/sf/skills/researcher/SKILL.md
@@ -13,6 +13,25 @@ Research a topic using four complementary information sources, in priority order
This skill is the first step before planning — it produces the evidence base that drives good decisions. Without research, agents plan from assumptions; with this skill, they plan from evidence.
+
+
+This is a **read-only** skill. Research produces a report; it does not modify the repo, the database, or external state.
+
+Strictly prohibited while running this skill:
+
+- File creation, modification, or deletion (no `Write`, `Edit`, `NotebookEdit`, `touch`, `rm`, `mv`, `cp`).
+- Bash redirects or heredocs that write files: `>`, `>>`, `tee`, `cat < file`, `python -c "open(...).write(...)"`. The shell back-door does not bypass the read-only contract.
+- DB writes: `sqlite3 .sf/sf.db "INSERT|UPDATE|DELETE|DROP|CREATE ..."`. Use `SELECT` only.
+- Git write operations: `add`, `commit`, `push`, `merge`, `rebase`, `checkout -b`, `branch -d`.
+- Package installs: `npm install`, `pip install`, `cargo add`.
+- Spawning subagents that perform any of the above on the researcher's behalf — the prohibition applies to delegated actions as well.
+
+The single allowed write is the research report itself, written via the parent agent's tools after this skill returns its findings — not by the researcher mid-skill. Communicate findings as a structured report in the response, not by writing intermediate scratch files.
+
+If a research question genuinely requires a write to answer (e.g., "does X actually work?" requires running a script), surface that as a finding labelled "requires execution" so the parent can dispatch a `worker` subagent — do not perform the write yourself.
+
+
+
**Serena MCP (code intelligence — USE FIRST for code exploration):**
diff --git a/src/resources/extensions/sf/skills/spec-first-tdd/SKILL.md b/src/resources/extensions/sf/skills/spec-first-tdd/SKILL.md
index 2f982e387..780fb1b5a 100644
--- a/src/resources/extensions/sf/skills/spec-first-tdd/SKILL.md
+++ b/src/resources/extensions/sf/skills/spec-first-tdd/SKILL.md
@@ -39,14 +39,22 @@ Side-chain (any bug/failure during TDD): invoke `systematic-debugging`. When the
## Workflow
-### 1. Purpose, Consumer, Value
+### 1. PDD packet (all 8 fields)
-Before any test, answer:
+Before any test, fill the PDD packet — same 8 fields as [`purpose-driven-development`](../purpose-driven-development/SKILL.md). Save to `.sf/active/{unit-id}/pdd.md`. The reviewer (`requesting-code-review`) reads from this same packet, so what you write here is what they review.
- **Purpose:** why this behaviour exists.
- **Consumer:** which production code path depends on it (`rg -n "fnName" src/ packages/`). If you can't name a real caller, stop.
-- **Value at risk:** what breaks (and who notices) if it returns the wrong answer.
-- **Out of scope:** what this slice explicitly does *not* cover.
+- **Contract:** what observable behaviour proves success — what the consumer receives, not the internal mechanics.
+- **Failure boundary:** what *correct failure* looks like if the purpose can't be fulfilled — degrade, surface, do not swallow.
+- **Evidence:** the test or metric that proves the contract. Must be machine-executable (named test file, queryable metric, runnable command) or `[MANUAL: reviewer + scenario]`. Prose-only is rejected.
+- **Non-goals:** what this slice explicitly does *not* cover.
+- **Invariants:** what must remain true. If the change touches async / queues / timers / state machines: split safety ("X never happens") + liveness ("Y eventually happens"). Pure synchronous code may use safety-only.
+- **Assumptions:** what conditions about the world MUST be true for this spec to be valid — locking protocols, API stability, caller invariants, deployment context, data shape.
+
+Also useful to flag for downstream review (not part of the PDD packet, but used by reviewers to calibrate scrutiny):
+
+- **Value at risk:** what breaks (and who notices) if the contract returns the wrong answer.
Label evidence:
@@ -187,3 +195,14 @@ Never fix code without a failing test first. The test is the proof that the bug
- `npm run typecheck:extensions` clean.
- For non-trivial runtime/provider fixes: explicit repro before code, solved boundary after.
- Updated active-slice artifact when one exists in `.sf/active/{unit-id}/`.
+
+Each evidence claim must include a Command/Output trace:
+
+```
+### Check:
+**Command run:**
+**Output observed:**
+**Result: PASS** (or FAIL — with Expected vs Actual)
+```
+
+Reading test source and asserting "this should pass" is not evidence. Run it; paste the output; then claim PASS.
diff --git a/src/resources/extensions/sf/skills/systematic-debugging/SKILL.md b/src/resources/extensions/sf/skills/systematic-debugging/SKILL.md
index de3a6dc14..1eeecf1bd 100644
--- a/src/resources/extensions/sf/skills/systematic-debugging/SKILL.md
+++ b/src/resources/extensions/sf/skills/systematic-debugging/SKILL.md
@@ -14,6 +14,18 @@ NO NON-TRIVIAL FIX WITHOUT AN EXPLICIT REPRO.
If you haven't completed Phase 1, you cannot propose fixes.
+## Recognize Your Own Rationalizations
+
+Debugging is where the LLM's verification failure modes hit hardest. You will feel the urge to skip evidence collection and jump to a fix. These are the exact excuses you reach for — recognize and invert each:
+
+- "I bet I know what this is — let me just try the fix." → No. The fix you have in mind is a hypothesis. Phase 1 (Evidence) and Phase 2 (Root Cause) are not optional.
+- "The error message obviously says it's X." → Error messages name the symptom. They rarely name the root cause. Read the stack trace; find the line; verify the call site.
+- "The test was probably flaky — let me re-run it." → Re-running a test that failed is not a fix; it is hoping the bug went away. If a re-run greens, the failure was not flaky — it was timing-dependent or order-dependent, which is a real bug.
+- "The diff is small, so the change can't be the cause." → Small diffs ship the most subtle regressions. Confirm via `git log` / `git bisect`, not by ruling out by size.
+- "It works on my machine." → "Your machine" is one execution. If the failure is reproducible, reproduce it; if not, capture the environment difference *as a finding*, not as a dismissal.
+
+If you catch yourself proposing a fix before Phase 1 evidence is in writing — stop. Read the logs, the trace, the event log. Then decide.
+
## When to Run
- A test fails, RED happens for the wrong reason, GREEN regresses, build fails, lint fails.
@@ -109,6 +121,19 @@ Use `spec-first-tdd` for the contract test if the bug crosses a non-trivial boun
4. For non-trivial fixes: re-run the same repro/debug path; capture the **solved** boundary output.
5. For runtime fixes: tail the event log and confirm the error class disappears.
+### Trace format for verification
+
+Every claimed PASS in the slice artefact must be backed by a trace — paraphrase is not evidence:
+
+```
+### Check:
+**Command run:**
+**Output observed:**
+**Result: PASS** (or FAIL — with Expected vs Actual)
+```
+
+A check without a `Command run` block is a skip. "I re-ran the repro and it works now" without the command and the output is a self-report, not verification.
+
Persist the pattern to memory so future units don't re-hit it:
```