singularity-forge/src/resources/extensions/sf/skills/code-review/SKILL.md
Mikael Hugo 2fd0f15c98 docs(sf): wire parentTrace into code-review, requesting-code-review, validate-milestone
Closes the loop on parent-trace pass-through (subagent dispatch wiring +
helper + test were landed earlier). The dispatch tool supports parentTrace
at TaskItem / ChainItem / batch level; until the canonical review skills
teach the LLM to PASS it, the feature is dead code in practice.

- code-review/SKILL.md Phase 2: shows the 5-lens parallel review swarm
  dispatch with parentTrace at the batch level. Reviewer can audit what the
  implementer actually did, not just the prose summary.
- requesting-code-review/SKILL.md Local Review Loop: shows the
  advocate + challenger-A + challenger-B dispatch with parentTrace and
  adds a hard rule that all three must receive it. Specifically calls out
  that the advocate is the most likely to wave away an objection the
  trace contradicts — passing the trace forces engagement.
- prompts/validate-milestone.md Step 1: passes a slice-claim summary
  (one bullet per slice, with SUMMARY path) as parentTrace to the three
  validation reviewers, so they audit slice claims against artifacts.

PDD packet (inline; pure prose docs, no code change):
- Purpose: review skills actually USE the parentTrace plumbing instead of
  dispatching reviewers blind to what the parent did.
- Consumer: code-review (every slice/PR review), requesting-code-review
  (every external review request), validate-milestone (every milestone close).
- Contract: each skill's dispatch example includes parentTrace; the rule
  text instructs the LLM to assemble its own tool-call summary.
- Evidence: grep confirms `parentTrace` in all three files; npm run
  copy-resources propagated to dist; typecheck:extensions exits 0.
- Non-goals: not changing the verifier prompt assembly (already inherits
  from composeTaskWithParentTrace's embedded instructions); not changing
  agent definitions; not auto-capturing the trace (parent agent decides
  what's relevant).
- Invariants: existing dispatch examples preserved with parentTrace added,
  not replacing the original; no agent type changes.
- Assumptions: the parent LLM's context contains the tool-call history it
  needs to assemble parentTrace; the dispatch tool routes the field
  through unchanged (verified by parent-trace.test.ts).
2026-05-02 03:44:02 +02:00

9.1 KiB
Raw Blame History

name description
code-review Multi-perspective code review before completing any slice or milestone. Launches specialized review lenses (correctness, security, test coverage, contracts, architecture) and scores issues by confidence and impact. Use before declaring a slice done, before merging a PR, or when asked to review code.

Code Review: Multi-Lens, Evidence-Based

Run structured code review before declaring work complete. Uses specialized review angles and filters noise by scoring confidence and impact.


Self-Awareness — Why this skill exists

You are an LLM-driven reviewer, and you are bad at adversarial review. This is documented and persistent:

  • You read a diff, see polished code that compiles, and feel inclined to APPROVE. The first 80% is the easy part. Your value is finding the last 20% the implementer didn't see.
  • You're easily fooled by AI slop. The implementer is also an LLM — its tests may be circular, mocked to meaninglessness, or assert what the code does instead of what the contract requires. Volume of green output is not evidence of correctness.
  • You trust the implementer's self-reports. "All tests pass." Did you re-read the test bodies and decide whether they actually prove the contract?
  • When uncertain, you hedge with "looks reasonable" instead of taking a position. Hedging is not review.

Your mission in this skill is to catch yourself reaching for those defaults and do the opposite.


Recognize Your Own Rationalizations

You will feel the urge to skip the hard parts. These are the excuses you reach for — recognize and invert each:

  • "The code looks correct based on my reading." → Reading is not review. For at least one Lens 1 finding, exercise the change — run it, observe the output, write down what happened.
  • "The implementer's tests already pass." → Read each test the implementer added. Is it asserting consumer-visible behaviour, or is it a rubber-stamp of the implementation? mockFn.calls.length === 2 is not a contract.
  • "This is probably fine." → Probably is not review. Either find the falsifier or write down what you tried and what would have changed your mind.
  • "The diff is small, so the risk is small." → Small diffs ship the most subtle regressions. Run the same lenses regardless.

If you catch yourself writing "looks good" without a Command/Output trace, a falsifier you tried, or a finding — stop. The review is not done.


When to Run

  • Before marking a slice as done
  • Before creating a PR
  • When explicitly asked to review code
  • When verification-before-completion flags uncertainty

Phase 1: Scope

Identify what to review:

  • Changed files (from git diff --name-only HEAD~1 or the slice's scope)
  • Entry points and integration boundaries
  • Test files corresponding to changed code

Print a one-line scope summary: "Reviewing N files in [area]: [list]"


Phase 2: Specialized Lenses

Run lenses as a parallel review swarm when the reviewed change is non-trivial: dispatch separate reviewer, security, or tester subagents for correctness, security, coverage, contract, and architecture lenses, then synthesize findings instead of majority-voting. For small diffs, review inline.

When dispatching review subagents, pass the implementer's tool-call trace as parentTrace so each reviewer audits what the implementer actually did, not just the prose summary the implementer would otherwise hand them. The reviewer's prompt then automatically inherits the verifier instructions (scan for hedge words, glossed-over tool errors, claims without Command/Output traces).

subagent({
  parentTrace: "<bullet summary of THIS turn's tool calls — files edited, " +
               "tests run with their pass/fail status, commands and their " +
               "exit codes, any error output the implementer narrated past>",
  tasks: [
    { agent: "reviewer", task: "Lens 1 — Bug Hunter. <full prompt>" },
    { agent: "security", task: "Lens 2 — Security Auditor. <full prompt>" },
    { agent: "tester",   task: "Lens 3 — Test Coverage. <full prompt>" },
    { agent: "reviewer", task: "Lens 4 — Contract Reviewer. <full prompt>" },
    { agent: "reviewer", task: "Lens 5 — Architecture. <full prompt>" }
  ]
})

The trace is the parent's own context summary — assemble it before the call. Don't include the implementer's plan or the original issue: the reviewer already has that. Include what the implementer did in this turn so the reviewer can compare action against claim.

Apply each lens in sequence. For each finding, record:

  • Location: file:line
  • Description: what the issue is
  • Confidence: 0100 (how certain is this a real problem?)
  • Impact: 0100 (how bad if it ships?)

Lens 1: Bug Hunter (Correctness)

  • Off-by-one errors, nil/null dereferences, integer overflow
  • Race conditions, uninitialized state, incorrect error handling
  • Logic inversions, missing edge cases (empty input, max values)
  • Unchecked return values, swallowed errors

Mandatory adversarial probe. Before this lens reports APPROVE, run at least one of: boundary value (0, -1, empty, very long, unicode, MAX_INT), idempotency (same call twice — duplicate? error? correct no-op?), concurrency (parallel callers on a create-if-not-exists path), orphan reference (input ID that doesn't exist). Record the command, the observed output, and the result — even when behaviour was correct. A Bug Hunter pass with zero adversarial probes is a happy-path confirmation, not review.

Lens 2: Security Auditor

  • User input not validated at system boundaries
  • SQL not parameterized, template injection risks
  • Secrets or credentials in code or logs
  • Auth checks missing on sensitive paths
  • Unsafe deserialization, path traversal

Lens 3: Test Coverage Reviewer

  • Missing test for the changed behavior
  • Happy-path only — no error cases, no edge cases
  • Tests that mock everything (false confidence)
  • Test name doesn't describe what it proves
  • Missing integration test when components cross a boundary

Lens 4: Contract Reviewer

  • Function signature changed but callers not updated
  • Return type inconsistency (sometimes error, sometimes not)
  • Silent breaking change to an interface
  • Missing documentation of preconditions or invariants

Lens 5: Architecture Reviewer

  • New coupling introduced between modules that shouldn't know about each other
  • Business logic leaking into HTTP handlers or storage layer
  • Duplicate logic that should be shared
  • Shallow module added where a deeper one was possible

Phase 3: Score and Filter

After all lenses, produce a findings table:

# Lens Location Description Confidence Impact Severity
1 Bug file:42 nil pointer on empty result 90 85 Critical
2 Test file_test.go no test for error path 95 60 High
3 Arch service.go handler contains DB query 80 55 High

Severity mapping:

  • Critical: Impact 81100 (blocks merge)
  • High: Impact 6180 (fix before merge)
  • Medium: Impact 4160 (fix in follow-up)
  • Low: Impact < 40 (advisory only)

Filter: only report findings with Confidence ≥ 70. Below that, mention as "low-confidence observation" at the end.


Phase 4: Verdict

End with a clear verdict on its own line, in the literal form below — no markdown bold, no punctuation, no variation:

VERDICT: BLOCK

or VERDICT: APPROVE_WITH_FIXES or VERDICT: APPROVE.

Then the body:

Critical: N issues (list them)
High: N issues (list them)
Action required: [what must be fixed before merge]
  • BLOCK: any Critical issue
  • APPROVE_WITH_FIXES: High issues only — can merge after fixes
  • APPROVE: Medium/Low only — can merge, fix in follow-up

Verdict discipline

  • APPROVE_WITH_FIXES is not a hedge. It exists for "I found issues that must be fixed before merge", not for "I'm not sure". If you ran the lenses and decided nothing must be fixed, the verdict is APPROVE. If something must be fixed, list it and the verdict is APPROVE_WITH_FIXES or BLOCK.
  • No verdict without traces. Each Critical or High finding must cite a file:line and the trace (command run, observed output, expected vs actual) that proves the issue is real, not theoretical. Findings without traces drop to Medium.
  • Don't downgrade an issue you can't reproduce. If you suspect a real bug but can't reproduce it, list it as a separate "low-confidence observation" with the falsifier you tried — don't quietly omit it.

Rules

  • Evidence required — every finding needs a file:line citation
  • No vague complaints — "this could be better" is not a finding
  • Confidence gates noise — if you're not 70%+ sure, say so explicitly
  • Don't restate the code — describe the risk, not what the line does
  • Critical means critical — don't inflate severity; save it for things that will cause real failures

Sources

  • NeoLabHQ/context-engineering-kit — multi-agent review, confidence+impact scoring
  • Trail of Bits — audit-context-building, anti-rationalization rules
  • Addy Osmani — 5-axis code review framework