docs(sf): self-awareness + adversarial probe + trace format in verify/review skills

Adds three patterns from Piebald-AI/claude-code-system-prompts (extracted from the public Claude Code npm bundle) to SF's two completion-gate skills: - "You are bad at this" self-awareness sections at the top of finish-and-verify and code-review — names the LLM-specific failure modes (read-don't-run, trust-self-reports, hedge-when-uncertain, fooled-by-AI-slop) instead of the generic "be thorough" framing. - Rationalization-callouts that name the exact excuses the agent reaches for ("probably fine", "tests already pass", "looks correct based on my reading") and invert each with a counter-instruction. - Mandatory adversarial probe before slice-done / Lens 1 APPROVE: at least one boundary / idempotency / concurrency / orphan-reference probe with documented result, even when behaviour was correct. - Command/Output/Result trace format for verification evidence — paraphrase is not evidence; a check without a Command-run block is a skip. - Anti-hedge guard on code-review verdicts: APPROVE_WITH_FIXES is not for "I'm not sure"; findings without traces drop to Medium. PDD packet (inline since prose-only edit, no code): - Purpose: when these skills load, the agent reads its own failure-mode catalogue - Consumer: every slice close (finish-and-verify) and every review (code-review) - Contract: SKILL.md text contains rationalizations + adversarial probe + trace format - Evidence: grep finds ≥3 keyword matches per file; typecheck:extensions exits 0; dist parity - Non-goals: no edits to gate-evaluate.md, dispatching-subagents, ask-gate.ts (deferred)
2026-05-02 03:05:47 +02:00 · 2026-05-02 03:05:47 +02:00 · ef325d7b49
commit ef325d7b49
parent 6ee31e83f4
2 changed files with 92 additions and 4 deletions
--- a/src/resources/extensions/sf/skills/code-review/SKILL.md
+++ b/src/resources/extensions/sf/skills/code-review/SKILL.md
@ -9,6 +9,32 @@ Run structured code review before declaring work complete. Uses specialized revi

 ---

+## Self-Awareness — Why this skill exists
+
+You are an LLM-driven reviewer, and you are bad at adversarial review. This is documented and persistent:
+
+- You read a diff, see polished code that compiles, and feel inclined to APPROVE. The first 80% is the easy part. Your value is finding the last 20% the implementer didn't see.
+- You're easily fooled by AI slop. The implementer is also an LLM — its tests may be circular, mocked to meaninglessness, or assert what the code does instead of what the contract requires. Volume of green output is not evidence of correctness.
+- You trust the implementer's self-reports. "All tests pass." Did *you* re-read the test bodies and decide whether they actually prove the contract?
+- When uncertain, you hedge with "looks reasonable" instead of taking a position. Hedging is not review.
+
+Your mission in this skill is to catch yourself reaching for those defaults and do the opposite.
+
+---
+
+## Recognize Your Own Rationalizations
+
+You will feel the urge to skip the hard parts. These are the excuses you reach for — recognize and invert each:
+
+- "The code looks correct based on my reading." → Reading is not review. For at least one Lens 1 finding, exercise the change — run it, observe the output, write down what happened.
+- "The implementer's tests already pass." → Read each test the implementer added. Is it asserting consumer-visible behaviour, or is it a rubber-stamp of the implementation? `mockFn.calls.length === 2` is not a contract.
+- "This is probably fine." → Probably is not review. Either find the falsifier or write down what you tried and what would have changed your mind.
+- "The diff is small, so the risk is small." → Small diffs ship the most subtle regressions. Run the same lenses regardless.
+
+If you catch yourself writing "looks good" without a Command/Output trace, a falsifier you tried, or a finding — stop. The review is not done.
+
+---
+
 ## When to Run

 - Before marking a slice as done
@ -48,6 +74,8 @@ Apply each lens in sequence. For each finding, record:
 - Logic inversions, missing edge cases (empty input, max values)
 - Unchecked return values, swallowed errors

+**Mandatory adversarial probe.** Before this lens reports APPROVE, run at least one of: boundary value (0, -1, empty, very long, unicode, MAX_INT), idempotency (same call twice — duplicate? error? correct no-op?), concurrency (parallel callers on a create-if-not-exists path), orphan reference (input ID that doesn't exist). Record the command, the observed output, and the result — even when behaviour was correct. A Bug Hunter pass with zero adversarial probes is a happy-path confirmation, not review.
+
 ### Lens 2: Security Auditor
 - User input not validated at system boundaries
 - SQL not parameterized, template injection risks
@ -98,20 +126,32 @@ Filter: only report findings with Confidence ≥ 70. Below that, mention as "low

 ## Phase 4: Verdict

-End with a clear verdict:
+End with a clear verdict on its own line, in the literal form below — no markdown bold, no punctuation, no variation:

 ```
-VERDICT: [BLOCK / APPROVE WITH FIXES / APPROVE]
+VERDICT: BLOCK
+```

+or `VERDICT: APPROVE_WITH_FIXES` or `VERDICT: APPROVE`.
+
+Then the body:
+
+```
 Critical: N issues (list them)
 High: N issues (list them)
 Action required: [what must be fixed before merge]
 ```

 - **BLOCK**: any Critical issue
- **APPROVE WITH FIXES**: High issues only — can merge after fixes
+- **APPROVE_WITH_FIXES**: High issues only — can merge after fixes
 - **APPROVE**: Medium/Low only — can merge, fix in follow-up

+### Verdict discipline
+
+- **APPROVE_WITH_FIXES is not a hedge.** It exists for "I found issues that must be fixed before merge", not for "I'm not sure". If you ran the lenses and decided nothing must be fixed, the verdict is APPROVE. If something must be fixed, list it and the verdict is APPROVE_WITH_FIXES or BLOCK.
+- **No verdict without traces.** Each Critical or High finding must cite a file:line *and* the trace (command run, observed output, expected vs actual) that proves the issue is real, not theoretical. Findings without traces drop to Medium.
+- **Don't downgrade an issue you can't reproduce.** If you suspect a real bug but can't reproduce it, list it as a separate "low-confidence observation" with the falsifier you tried — don't quietly omit it.
+
 ---

 ## Rules
--- a/src/resources/extensions/sf/skills/finish-and-verify/SKILL.md
+++ b/src/resources/extensions/sf/skills/finish-and-verify/SKILL.md
@ -32,13 +32,38 @@ If used inside an autonomous iteration loop and the user goal is still in progre
 3. Refresh `.sf/active/{unit-id}/active-slice.md` if one exists.
 4. **Return control to the loop. Do not stop just because the slice is green.**

+## Self-Awareness — Why this skill exists
+
+You are an LLM-driven agent, and you are bad at verification. This is documented and persistent:
+
+- You read code and write "looks good" instead of running it.
+- You see the first 80% — a polished diff, a test name that matches the contract — and feel inclined to declare done. The first 80% is the easy part. Your value is the last 20%.
+- You trust your own self-reports. "Tests pass." Did *you* run them, just now, after the last edit?
+- You hedge when uncertain instead of deciding. "Probably fine" is not a verification result.
+
+Knowing this, your job in this skill is to catch yourself doing those things and do the opposite.
+
+## Recognize Your Own Rationalizations
+
+You will feel the urge to skip checks. These are the exact excuses you reach for — recognize them and invert each:
+
+- "The code looks correct based on my reading." → Reading is not verification. Run it.
+- "The implementer's tests already pass." → The implementer is also an LLM. Verify independently — re-run, and read what each test actually asserts.
+- "This is probably fine." → Probably is not verified. Run it.
+- "Let me start the server and check the code." → No. Start the server and hit the endpoint.
+- "This would take too long to actually exercise." → Not your call. The unit is not done until exercised.
+- "The diff is small, so the risk is small." → Small diffs ship the most subtle regressions. Run the same checks regardless of diff size.
+
+If you catch yourself writing an explanation in place of a command — stop. Run the command, paste the output, then write the explanation.
+
 ## Slice Done Gate

 Do not declare a slice done until **all** are true:

- [ ] Contract test is green.
+- [ ] Contract test is green — and you ran it now, not 5 minutes ago.
 - [ ] Required component verification is green (`npm run typecheck:extensions`, `npm test`, lint clean for touched files).
 - [ ] Consumer-path check is explicit: `rg` confirms a real caller still depends on the changed symbol.
+- [ ] **At least one adversarial probe was run.** Pick whichever fits the change: boundary value (0, -1, empty, very long, unicode), idempotency (apply twice — duplicate? error? correct no-op?), concurrency (parallel callers on a create-if-not-exists path), orphan reference (input ID that doesn't exist). Document the result even when "handled correctly" — a green run with no probe is happy-path confirmation, not verification.
 - [ ] Active-slice artefact (`.sf/active/{unit-id}/active-slice.md` or equivalent) is updated with current state.
 - [ ] Falsifier from the plan was either checked or admitted as residual risk.
 - [ ] No slice-local next step remains undocumented.
@ -123,6 +148,29 @@ For non-trivial slices:

 For runtime/provider/transport changes, also capture explicit before/after repro evidence — traces alone are insufficient when boundary behaviour changed.

+### Trace Format for Verification Evidence
+
+Every claimed PASS in the slice artefact must be backed by a trace of this shape — paraphrase is not evidence:
+
+```
+### Check: <what was verified>
+**Command run:** <exact command, copy-pasteable>
+**Output observed:** <actual stdout/stderr — copy-paste, not paraphrase; truncate long output but keep the relevant part>
+**Result: PASS** (or FAIL — with Expected vs Actual)
+```
+
+Rejected (do not use):
+
+```
+### Check: typecheck
+**Result: PASS**
+Evidence: I reviewed the touched files and the types look consistent.
+```
+
+(No command run. Reading types is not running tsc.)
+
+A check without a `Command run` block is not a PASS — it's a skip.
+
 ## Rules

 - Do not claim completion without fresh verification — even if you ran tests 5 minutes ago, run them again after the last edit.