Add 6 SF skills: pm-planning, codebase-analysis, architecture-planning, feature-gap-analysis, code-review, advisory-partner

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 04:51:43 +02:00 · 2026-04-19 04:51:43 +02:00 · 254fba36c0
commit 254fba36c0
parent 6fc286a888
6 changed files with 1064 additions and 0 deletions
--- a/src/resources/extensions/sf/skills/advisory-partner/SKILL.md
+++ b/src/resources/extensions/sf/skills/advisory-partner/SKILL.md
@ -0,0 +1,111 @@
+---
+name: advisory-partner
+description: Framework for independent advisory review of plans and decisions. Runs as a separate task with the validation model — NOT self-review by the planning model. In SF, this is the framework used by gate-evaluate (Q3/Q4) and validate-milestone (MV01-MV04). Use when dispatching a subagent to review a plan before committing to it.
+---
+
+# Advisory Partner: Independent Review with a Different Model
+
+This skill is for **independent review**, not self-review. It is meant to run as a separate agent dispatch using the `validation` model, giving a genuine second opinion on plans and decisions.
+
+In SF, this pattern is already built into the pipeline:
+- **`gate-evaluate`** runs Q3 (security abuse surface) and Q4 (broken promises) with the `validation` model before slice execution
+- **`validate-milestone`** runs MV01-MV04 with the `validation` model after milestone execution
+- Both use a different model from the planning/execution model — that's the point
+
+Do NOT add this to `always_use_skills` — that would make the planning model self-review, which misses the point. The advisory value comes from a different model challenging the plan.
+
+---
+
+## When to Dispatch an Advisory Review
+
+Use `subagent` to dispatch an advisory review (with the `validation` model) when:
+- A milestone plan has high risk or novel architecture — dispatch advisory before `plan-milestone` commits
+- A slice plan crosses multiple subsystems — dispatch advisory before `execute-task` starts
+- A significant architectural decision needs challenge — dispatch advisory, then write the ADR
+- The planning model is uncertain and needs a second opinion
+
+---
+
+## Advisory Review Protocol
+
+When running as the advisory agent, apply this framework:
+
+### 1. State the Decision Under Review
+
+Read the artifact being reviewed (CONTEXT.md, ROADMAP.md, or slice plan). Summarize:
+> "The plan proposes [X]. The core claim is [Y]. The alternative not taken is [Z]."
+
+If you can't fill this in, the plan is incomplete.
+
+---
+
+### 2. Five Challenger Questions
+
+Answer each from the artifact — not from general knowledge:
+
+**Q1: What problem does this actually solve?**
+Which specific struggling moment (TODO, FIXME, missing test, user complaint) does this plan address? If none, flag it.
+
+**Q2: What assumptions are baked in?**
+List ≥3 assumptions. For each: is it tested or untested? What happens if it's false?
+
+**Q3: What's the failure mode in 6 months?**
+Name the specific thing that breaks, not "it doesn't work." Who notices first?
+
+**Q4: What's the simplest proof point?**
+What's the smallest deliverable that would confirm the plan is on the right track? Does the plan include it early enough?
+
+**Q5: What's the strongest objection?**
+Write the objection. Then answer it. If the answer is weak, that's a flag.
+
+---
+
+### 3. Trap Scan
+
+| Trap | Warning Sign |
+|------|-------------|
+| Shiny object | New tech without a diagnosed struggling moment |
+| Scope creep | "While we're at it..." additions |
+| Premature abstraction | Generic infrastructure with <3 real use cases |
+| Missing consumer | Module with no real callers (only tests) |
+| Axle not scooter | Infrastructure layer with no standalone demo value |
+
+---
+
+### 4. Verdict
+
+```
+ADVISORY VERDICT: [PROCEED / PROCEED WITH CAVEAT / RECONSIDER]
+
+Solid: [what's well-grounded]
+Gaps: [what needs resolution]
+Action: [one sentence]
+```
+
+- **PROCEED**: plan is grounded, assumptions tested, clear proof point exists
+- **PROCEED WITH CAVEAT**: one specific thing must be resolved first (state it)
+- **RECONSIDER**: a core assumption is untested or a better approach exists (state it)
+
+---
+
+## Integration with SF Gate System
+
+The existing SF gates are specific instances of this advisory framework:
+
+| Gate | Question | Owner Turn | Model |
+|------|----------|-----------|-------|
+| Q3 | How can this be exploited? | `gate-evaluate` | `validation` |
+| Q4 | What existing promises does this break? | `gate-evaluate` | `validation` |
+| Q5 | What breaks when dependencies fail? | `execute-task` | `execution` |
+| Q8 | How will ops know this is healthy? | `complete-slice` | `completion` |
+| MV01-MV04 | Requirements coverage, integration, acceptance criteria | `validate-milestone` | `validation` |
+
+If you want advisory review for planning decisions (not yet covered by gates), dispatch a `subagent` explicitly with the advisory-partner prompt and request that it use the validation model configuration.
+
+---
+
+## Sources
+
+- Richard Rumelt, *Good Strategy/Bad Strategy* — diagnosis before policy
+- Ryan Singer, *Shape Up* — proof points, appetite-based scoping
+- Teresa Torres, *Continuous Discovery Habits* — assumption testing
--- a/src/resources/extensions/sf/skills/architecture-planning/SKILL.md
+++ b/src/resources/extensions/sf/skills/architecture-planning/SKILL.md
@ -0,0 +1,218 @@
+---
+name: architecture-planning
+description: Plan and document software architecture decisions. Use when designing systems, choosing between architectural approaches, refactoring modules, evaluating coupling patterns, or recording decisions in ADRs. Combines C4 model visualization, Architecture Decision Records, deep-module refactoring (Ousterhout), and Rumelt strategy structure. Use alongside codebase-analysis and pm-planning skills.
+---
+
+# Architecture Planning: Design Before You Build
+
+Apply structured architectural thinking before implementing anything significant. The three non-negotiables: (1) understand what exists before proposing changes, (2) record every significant decision as an ADR, (3) design for depth not breadth.
+
+---
+
+## Part 1: Architecture Mapping (C4 Model)
+
+Visualize the system at the right level of abstraction before planning changes.
+
+### Four levels (use the right one for your audience)
+
+**Level 1 — System Context** (zoom out, 1 diagram)
+Who uses the system? What external systems does it depend on?
+```
+[User] → [This System] → [External API]
+                       → [Database]
+                       → [Message Queue]
+```
+
+**Level 2 — Container** (zoom in, 1 diagram)
+What are the deployable units? How do they communicate?
+```
+[Web App] → [API Server] → [PostgreSQL]
+                         → [Redis Cache]
+                         → [Worker Process]
+```
+
+**Level 3 — Component** (zoom in more, per-container)
+What are the major logical components inside a container?
+- Use only when a container is complex enough to need it
+
+**Level 4 — Code** (zoom in furthest, per-component)
+Classes, interfaces, functions — only when truly needed for communication
+
+### C4 rules
+- Every element needs: name, type, technology, one-line description
+- Limit diagrams to 20 elements maximum (beyond that, split or zoom out)
+- Show relationships as one-directional arrows with a verb label
+- Write diagrams as Mermaid in `docs/architecture/`
+
+### Mermaid template (Level 2 container)
+```mermaid
+graph LR
+    user[User\nWeb Browser] -->|HTTPS| api[API Server\nGo/gin]
+    api -->|SQL| db[(PostgreSQL\nData store)]
+    api -->|pub/sub| queue[RabbitMQ\nEvent bus]
+    queue -->|consume| worker[Worker\nGo service]
+    worker -->|SQL| db
+```
+
+---
+
+## Part 2: Architecture Decision Records (ADRs)
+
+Every significant architectural decision gets an ADR. "Significant" means: choosing a framework, data store, deployment approach, inter-service protocol, auth mechanism, or any decision you might second-guess in six months.
+
+### When to write an ADR
+- Choosing between two significant alternatives
+- Deciding "we won't do X even though it seems obvious"
+- Any decision you wouldn't want to re-debate in three months
+- Any decision a new engineer would be confused by
+
+### ADR format (Michael Nygard + ECC extensions)
+
+Store in `docs/adr/ADR-NNNN-title-slug.md`:
+
+```markdown
+# ADR-NNNN: [Decision Title]
+
+**Date**: YYYY-MM-DD
+**Status**: proposed | accepted | deprecated | superseded by ADR-NNNN
+
+## Context
+
+[2-5 sentences: the situation, constraints, forces at play. What made this a decision worth recording?]
+
+## Decision
+
+[1-3 sentences: what are we doing?]
+
+## Alternatives Considered
+
+### Alternative: [Name]
+- **Pros**: ...
+- **Cons**: ...
+- **Why not**: [specific reason rejected]
+
+## Consequences
+
+**Positive**: what gets easier
+**Negative**: what gets harder or is sacrificed
+**Risks**: what could go wrong and how we'll mitigate
+```
+
+### ADR anti-patterns
+- Don't record trivial decisions ("we used a for-loop")
+- Don't write ADRs without alternatives considered — it's just a log entry
+- Don't let ADRs get stale — update status when superseded
+- Don't duplicate what's already in DECISIONS.md — keep them in sync (sf uses DECISIONS.md; ADRs go in docs/adr/ for human navigation)
+
+---
+
+## Part 3: Deep vs Shallow Modules (Ousterhout)
+
+Before refactoring or adding abstractions, evaluate whether modules are deep or shallow.
+
+**Deep module:** small interface, large implementation
+- Interface complexity << implementation complexity
+- Easy to test at the boundary
+- AI-navigable (fewer entry points to understand)
+- Example: `os.File` — simple open/read/write/close hides massive OS complexity
+
+**Shallow module:** interface almost as complex as implementation
+- Provides little abstraction value
+- Forces callers to know too much
+- Example: a "wrapper" that just delegates every method with no added logic
+
+### Detecting shallow modules
+```bash
+# Find files with many exported symbols but little internal logic
+rg "^func |^type |^var |^const " --count | sort -t: -k2 -nr | head -20
+
+# Find files where the interface is large relative to implementation
+wc -l *.go | sort -n  # small files that are heavily imported = shallow
+```
+
+### Refactoring to depth — parallel design process
+
+When you've identified a shallow module or coupling problem:
+
+1. **Write the constraint spec** — what must any new interface do? What dependencies does it have?
+2. **Design 3+ radically different interfaces** (can be sub-agents in parallel):
+   - Agent 1: "Minimize interface — aim for 1-3 entry points max"
+   - Agent 2: "Maximize flexibility — support many use cases"
+   - Agent 3: "Optimize for the most common caller — make the default trivial"
+3. **Compare designs** — show trade-offs for each
+4. **Pick or hybrid** — be opinionated; the goal is a strong recommendation, not a menu
+5. **Document as ADR** — record what was chosen and why alternatives were rejected
+
+### Dependency categories (what's being hidden)
+
+| Category | Description | Design implication |
+|---|---|---|
+| **I/O** | File, network, database | Inject interface, mock in tests |
+| **State** | In-memory, cache | Pass explicitly or use value objects |
+| **Cross-cutting** | Logging, metrics, auth | Pass as context/middleware, not global |
+| **Domain logic** | Business rules | Keep pure; no I/O dependencies |
+
+---
+
+## Part 4: Architectural Integrity Checks
+
+Run these before approving any milestone plan:
+
+**Coupling check**
+```bash
+# Detect potential circular imports (Go)
+go build ./... 2>&1 | grep "import cycle"
+
+# Find most-depended-on files (high coupling risk)
+rg "import|require|from" --count | sort -t: -k2 -rn | head -20
+```
+
+**Boundary violations**
+- Does any module import from a module it shouldn't know about?
+- Does any layer (e.g. HTTP handler) contain business logic?
+- Does any data store module leak implementation details to callers?
+
+**Abstraction level consistency**
+- Is this module operating at a consistent level of abstraction throughout?
+- Are there places where high-level flow is mixed with low-level implementation detail?
+
+**Testability**
+- Can this component be tested without starting the database?
+- Can this component be tested without real network calls?
+- Can this component be tested without the full application running?
+- If the answer is "no" to any: it has a hidden dependency that needs to be injected
+
+---
+
+## Part 5: Architecture Memory in `.sf/`
+
+After any architecture analysis or significant decision, update:
+
+**`.sf/CODEBASE-ANALYSIS.md`** (from codebase-analysis skill)
+- Architecture map (C4 Level 1-2 in text)
+- Module coupling findings
+
+**`.sf/DECISIONS.md`** (via `sf_decision_save` tool)
+- All significant architectural decisions
+- Automatically regenerated by the tool
+
+**`docs/adr/`** (human-readable ADR trail)
+- Detailed decision records with alternatives
+- Links from DECISIONS.md entries
+
+**`.sf/PM-STRATEGY.md`** (from pm-planning skill)
+- Guiding Policies section: architectural principles that constrain future decisions
+
+These four files together give any future agent (or engineer) a complete architectural picture without rereading the full codebase.
+
+---
+
+## Sources
+
+- [softaworks/agent-toolkit](https://github.com/softaworks/agent-toolkit) — C4 architecture skill
+- [affaan-m/everything-claude-code](https://github.com/affaan-m/everything-claude-code) — ADR skill (Michael Nygard format)
+- [mattpocock/skills](https://github.com/mattpocock/skills) — improve-codebase-architecture, parallel interface design
+- [trailofbits/skills](https://github.com/trailofbits/skills) — audit-context-building, phase structure
+- John Ousterhout, *A Philosophy of Software Design* — deep vs shallow modules
+- Richard Rumelt, *Good Strategy/Bad Strategy* — diagnosis and guiding policies
+- Simon Brown — C4 model for visualizing software architecture
--- a/src/resources/extensions/sf/skills/code-review/SKILL.md
+++ b/src/resources/extensions/sf/skills/code-review/SKILL.md
@ -0,0 +1,126 @@
+---
+name: code-review
+description: Multi-perspective code review before completing any slice or milestone. Launches specialized review lenses (correctness, security, test coverage, contracts, architecture) and scores issues by confidence and impact. Use before declaring a slice done, before merging a PR, or when asked to review code.
+---
+
+# Code Review: Multi-Lens, Evidence-Based
+
+Run structured code review before declaring work complete. Uses specialized review angles and filters noise by scoring confidence and impact.
+
+---
+
+## When to Run
+
+- Before marking a slice as done
+- Before creating a PR
+- When explicitly asked to review code
+- When `verification-before-completion` flags uncertainty
+
+---
+
+## Phase 1: Scope
+
+Identify what to review:
+- Changed files (from `git diff --name-only HEAD~1` or the slice's scope)
+- Entry points and integration boundaries
+- Test files corresponding to changed code
+
+Print a one-line scope summary: "Reviewing N files in [area]: [list]"
+
+---
+
+## Phase 2: Specialized Lenses
+
+Apply each lens in sequence. For each finding, record:
+- **Location**: file:line
+- **Description**: what the issue is
+- **Confidence**: 0–100 (how certain is this a real problem?)
+- **Impact**: 0–100 (how bad if it ships?)
+
+### Lens 1: Bug Hunter (Correctness)
+- Off-by-one errors, nil/null dereferences, integer overflow
+- Race conditions, uninitialized state, incorrect error handling
+- Logic inversions, missing edge cases (empty input, max values)
+- Unchecked return values, swallowed errors
+
+### Lens 2: Security Auditor
+- User input not validated at system boundaries
+- SQL not parameterized, template injection risks
+- Secrets or credentials in code or logs
+- Auth checks missing on sensitive paths
+- Unsafe deserialization, path traversal
+
+### Lens 3: Test Coverage Reviewer
+- Missing test for the changed behavior
+- Happy-path only — no error cases, no edge cases
+- Tests that mock everything (false confidence)
+- Test name doesn't describe what it proves
+- Missing integration test when components cross a boundary
+
+### Lens 4: Contract Reviewer
+- Function signature changed but callers not updated
+- Return type inconsistency (sometimes error, sometimes not)
+- Silent breaking change to an interface
+- Missing documentation of preconditions or invariants
+
+### Lens 5: Architecture Reviewer
+- New coupling introduced between modules that shouldn't know about each other
+- Business logic leaking into HTTP handlers or storage layer
+- Duplicate logic that should be shared
+- Shallow module added where a deeper one was possible
+
+---
+
+## Phase 3: Score and Filter
+
+After all lenses, produce a findings table:
+
+| # | Lens | Location | Description | Confidence | Impact | Severity |
+|---|------|----------|-------------|-----------|--------|----------|
+| 1 | Bug | file:42 | nil pointer on empty result | 90 | 85 | Critical |
+| 2 | Test | file_test.go | no test for error path | 95 | 60 | High |
+| 3 | Arch | service.go | handler contains DB query | 80 | 55 | High |
+
+Severity mapping:
+- **Critical**: Impact 81–100 (blocks merge)
+- **High**: Impact 61–80 (fix before merge)
+- **Medium**: Impact 41–60 (fix in follow-up)
+- **Low**: Impact < 40 (advisory only)
+
+Filter: only report findings with Confidence ≥ 70. Below that, mention as "low-confidence observation" at the end.
+
+---
+
+## Phase 4: Verdict
+
+End with a clear verdict:
+
+```
+VERDICT: [BLOCK / APPROVE WITH FIXES / APPROVE]
+
+Critical: N issues (list them)
+High: N issues (list them)
+Action required: [what must be fixed before merge]
+```
+
+- **BLOCK**: any Critical issue
+- **APPROVE WITH FIXES**: High issues only — can merge after fixes
+- **APPROVE**: Medium/Low only — can merge, fix in follow-up
+
+---
+
+## Rules
+
+- **Evidence required** — every finding needs a file:line citation
+- **No vague complaints** — "this could be better" is not a finding
+- **Confidence gates noise** — if you're not 70%+ sure, say so explicitly
+- **Don't restate the code** — describe the risk, not what the line does
+- **Critical means critical** — don't inflate severity; save it for things that will cause real failures
+
+---
+
+## Sources
+
+- NeoLabHQ/context-engineering-kit — multi-agent review, confidence+impact scoring
+- Trail of Bits — audit-context-building, anti-rationalization rules
+- Addy Osmani — 5-axis code review framework
--- a/src/resources/extensions/sf/skills/codebase-analysis/SKILL.md
+++ b/src/resources/extensions/sf/skills/codebase-analysis/SKILL.md
@ -0,0 +1,217 @@
+---
+name: codebase-analysis
+description: Deep, structured codebase analysis before planning or implementing anything. Use at the start of any autonomous session, research phase, or when analyzing an unfamiliar codebase. Produces a reliable mental model — architecture map, technical debt inventory, test coverage gaps, and struggling moments — that feeds directly into milestone planning. Based on Trail of Bits audit-context-building, Anthropic tech-debt, and Addy Osmani code-review frameworks.
+---
+
+# Codebase Analysis: Build Understanding Before Planning
+
+Produce a deep, evidence-based understanding of the codebase before any planning or implementation. This skill prevents the most common autonomous agent failure: planning based on assumptions rather than what's actually in the code.
+
+**Core principle:** Slow is fast. Every hour of thorough analysis prevents days of building the wrong thing.
+
+---
+
+## Phase 1: Orientation (Bottom-Up Mapping)
+
+Start with a minimal structural map. Do NOT assume behavior from file names alone.
+
+```
+rg --files | head -100          # what's here
+find . -name "*.go" -o -name "*.ts" -o -name "*.rs" | head -50
+ls -la                          # top-level structure
+cat go.mod / package.json / Cargo.toml   # dependencies and version
+git log --oneline -20           # recent history
+git log --all --oneline --stat --since="30 days ago" | head -50  # recent churn
+```
+
+Build a preliminary map:
+1. **Modules / packages** — major units of organization
+2. **Entry points** — main(), HTTP handlers, CLI commands, exported APIs
+3. **Actors** — who calls what (users, crons, external services, other services)
+4. **Storage** — databases, files, caches, queues touched
+5. **External dependencies** — APIs, third-party services called
+6. **Build / test / deploy** — how it's built, how tests run, how it's deployed
+
+Do NOT proceed to Phase 2 until this map exists in writing.
+
+---
+
+## Phase 2: Ultra-Granular Function Analysis
+
+For every non-trivial function on the critical path, apply full micro-analysis.
+
+### Per-function checklist
+
+For each function:
+
+**Purpose** — why it exists, its role in the system
+
+**Inputs & Assumptions**
+- Parameters and their types
+- Implicit inputs: global state, environment variables, context
+- Preconditions: what must be true for this to work?
+- What happens if a precondition is violated?
+
+**Outputs & Effects**
+- Return value and its meaning
+- State/storage mutations
+- Side effects: logs, metrics, events emitted, external calls
+
+**Invariants** (≥3 per function)
+- What is always true before this runs?
+- What is always true after this runs?
+- What must never happen inside this function?
+
+**Assumptions** (≥3 per function)
+- What does this code assume about its callers?
+- What does it assume about external systems?
+- What would break if these assumptions were wrong?
+
+**Risk considerations** (≥3 per function)
+- What happens under concurrent access?
+- What happens if an external call fails partway through?
+- What input values would cause unexpected behavior?
+
+### Anti-rationalization rules (from Trail of Bits)
+
+| Temptation | Why it's wrong | Required action |
+|---|---|---|
+| "I get the gist" | Gist-level misses edge cases | Line-by-line anyway |
+| "This function is simple" | Simple functions compose into complex bugs | Apply analysis anyway |
+| "I'll remember this invariant" | You won't — context degrades | Write it down |
+| "External call is probably fine" | External = adversarial until proven | Trace the full call chain |
+| "I can skip this helper" | Helpers carry assumptions that propagate | Trace it |
+| "This is taking too long" | Rushed context = wrong plans | Slow is fast |
+
+---
+
+## Phase 3: Technical Debt Inventory
+
+Scan for struggling moments encoded in the codebase. Classify by type and score with priority formula.
+
+### Six debt categories (from Anthropic tech-debt skill)
+
+| Type | What to scan for | Risk if ignored |
+|---|---|---|
+| **Code debt** | `TODO/FIXME/HACK/XXX`, duplicated logic, magic numbers, functions >50 lines, cyclomatic complexity >10 | Bugs, slow feature work |
+| **Architecture debt** | God objects, circular dependencies, wrong data store (file used as DB), tight coupling across boundaries | Scaling limits, impossible to test |
+| **Test debt** | Missing test files on critical paths, `// TODO test this`, test files with only happy paths, no integration tests | Regressions ship undetected |
+| **Dependency debt** | `go.sum`/`package-lock.json` outdated packages, CVE-flagged versions, unmaintained libraries (last commit >2 years) | Security vulnerabilities |
+| **Documentation debt** | Missing README, no runbook, tribal knowledge in comments, outdated architecture docs | Onboarding failure, incident recovery pain |
+| **Infrastructure debt** | No metrics, no structured logging, manual deploy steps, no health checks, no graceful shutdown | Silent failures, 3am incidents |
+
+### Priority formula
+
+```
+Priority = (Impact + Risk) × (6 - Effort)
+
+Impact: 1-5  (how much does it slow development?)
+Risk:   1-5  (what happens if we don't fix it? data loss=5, UX glitch=1)
+Effort: 1-5  (how hard? hours=1, months=5)
+```
+
+High-priority items (score >20) are milestone candidates. Low-priority items (score <10) are lead bullets.
+
+---
+
+## Phase 4: Test Coverage Analysis
+
+Run the test suite. Measure. Don't assume.
+
+```bash
+go test ./... -cover 2>&1 | tee test-output.txt
+cargo test 2>&1 | tee test-output.txt
+npm test -- --coverage 2>&1 | tee test-output.txt
+pytest --cov=. --cov-report=term-missing 2>&1 | tee test-output.txt
+```
+
+Identify:
+- **Critical paths with zero test coverage** — cannonball debt
+- **Flaky tests** (tests that sometimes fail) — reliability debt
+- **Happy-path-only tests** (no error cases, no edge cases) — false confidence
+- **Missing integration tests** — components tested in isolation but never together
+- **Test files that mock everything** — tests that don't catch real failures
+
+---
+
+## Phase 5: Code Quality Scan (5-axis review, from Addy Osmani)
+
+Apply to the most-changed files (`git log --stat` to find them):
+
+**1. Correctness**
+- Edge cases handled? (null, empty, boundary values)
+- Error paths handled beyond happy path?
+- Off-by-one errors, race conditions, state inconsistencies?
+
+**2. Readability & Simplicity**
+- Names descriptive and consistent?
+- Control flow straightforward (no nested ternaries, deep callbacks)?
+- Could this be done in 10x fewer lines? (1000 lines where 100 suffice is a failure)
+- Are abstractions earning their complexity? (don't generalize until 3rd use case)
+- Dead code: no-op variables, backwards-compat shims, `// removed` comments?
+
+**3. Architecture**
+- Clean module boundaries?
+- Circular dependencies? (`rg "import.*moduleA" moduleA/` to detect)
+- Code duplication that should be shared?
+- Dependencies flowing in the right direction?
+- **Deep vs shallow modules?** (Ousterhout: shallow = interface as complex as implementation = bad)
+
+**4. Security**
+- User input validated at boundaries?
+- Secrets out of code and logs?
+- Auth checked where needed?
+- SQL parameterized?
+- External data treated as untrusted?
+
+**5. Performance**
+- N+1 query patterns?
+- Unbounded loops or unconstrained fetches?
+- Blocking operations that should be async?
+- Missing pagination on list endpoints?
+
+---
+
+## Output: Codebase Analysis Report
+
+Write findings to `.sf/CODEBASE-ANALYSIS.md`:
+
+```markdown
+# Codebase Analysis
+
+_Date: YYYY-MM-DD | Analyzed by: <unit-type>_
+
+## Architecture Map
+<!-- Module diagram, entry points, actors, storage, external deps -->
+
+## Critical Path Summary
+<!-- The 3-5 most important flows in the system -->
+
+## Technical Debt Inventory
+| Item | Type | Impact | Risk | Effort | Score | Milestone candidate? |
+|---|---|---|---|---|---|---|
+
+## Test Coverage Gaps
+<!-- Critical paths with no tests, flaky tests, false-confidence tests -->
+
+## Code Quality Findings
+<!-- Top issues by 5-axis review, with file:line references -->
+
+## Struggling Moments
+<!-- TODO/FIXME/HACK count by file, most critical entries verbatim -->
+
+## Diagnosis
+<!-- One paragraph: what is the core challenge this codebase faces? -->
+```
+
+This file feeds directly into PM-STRATEGY.md (opportunity map) and milestone CONTEXT.md (diagnosis section). Write it before planning any milestones.
+
+---
+
+## Sources
+
+- [trailofbits/skills](https://github.com/trailofbits/skills) — audit-context-building, ultra-granular analysis phases
+- [anthropics/knowledge-work-plugins](https://github.com/anthropics/knowledge-work-plugins) — tech-debt categories and priority formula
+- [addyosmani/agent-skills](https://github.com/addyosmani/agent-skills) — five-axis code review framework
+- [OthmanAdi/codebase-knowledge-builder](https://github.com/OthmanAdi/codebase-knowledge-builder) — four-phase methodology
+- John Ousterhout, *A Philosophy of Software Design* — deep vs shallow modules
--- a/src/resources/extensions/sf/skills/feature-gap-analysis/SKILL.md
+++ b/src/resources/extensions/sf/skills/feature-gap-analysis/SKILL.md
@ -0,0 +1,140 @@
+---
+name: feature-gap-analysis
+description: Compare a vision or spec against the actual codebase. Extracts a feature map from the vision, scans the codebase for each feature, and produces a prioritized gap list. Use at the start of milestone planning when you have a vision document but are unsure what's already built.
+---
+
+# Feature Gap Analysis: Vision → Feature Map → Codebase Diff
+
+When you have a vision, spec, or existing `PROJECT.md` / `CONTEXT.md`, run this skill before planning milestones. It prevents building what's already there and surfaces the real gaps.
+
+---
+
+## Step 1: Extract the Feature Map
+
+Read the vision inputs (in priority order, use all that exist):
+- `.sf/PROJECT.md` — high-level project description
+- `.sf/milestones/*/CONTEXT.md` — milestone intent
+- `README.md` / `VISION.md` — top-level product description
+- User-provided spec document
+
+For each input, extract a flat list of **capabilities** the vision describes. A capability is a user-visible behavior or system property, not an implementation detail.
+
+Format as a table:
+
+| # | Capability | Source | Category |
+|---|-----------|--------|----------|
+| 1 | Users can log in with email+password | README.md | Auth |
+| 2 | DR agent runs as a Windows service | VISION.md | Agent |
+| 3 | Portal shows restore job status | PROJECT.md | UI |
+
+Categories help with grouping: Auth, API, UI, Agent, Storage, Observability, Testing, Ops.
+
+Keep entries atomic — one testable behavior per row. If a sentence describes 3 things, make 3 rows.
+
+---
+
+## Step 2: Scan the Codebase
+
+For each capability, find evidence of implementation. Use `rg`, `find`, and file reads.
+
+Scan strategy per category:
+- **Auth**: look for login handlers, session/token code, middleware
+- **API**: look for route definitions, handler files
+- **UI**: look for page components, templates, routes
+- **Agent**: look for service registration, main loop, config parsing
+- **Storage**: look for DB schema, migration files, model structs
+- **Observability**: look for metrics, structured logging, health endpoints
+- **Testing**: look for test files, coverage data
+
+Classify each capability:
+
+| Status | Meaning |
+|--------|---------|
+| **Implemented** | Code exists, tests exist, it works |
+| **Partial** | Code skeleton exists but incomplete — missing tests, edge cases, or integration |
+| **Missing** | No evidence in codebase |
+| **Unclear** | Code exists but purpose is ambiguous — needs investigation |
+
+Quick scan commands:
+```bash
+# Find by keyword
+rg -l "<feature keyword>" --type go
+
+# Find test files for a component
+find . -name "*_test.go" | xargs grep -l "<component>"
+
+# Check migration coverage
+ls migrations/
+
+# Find handler registrations
+rg "router\.|mux\.|r\.GET\|r\.POST" --type go | head -30
+```
+
+---
+
+## Step 3: Produce the Gap List
+
+Build the combined table:
+
+| # | Capability | Status | Evidence | Gap Notes |
+|---|-----------|--------|----------|-----------|
+| 1 | Email+password login | Implemented | portal/auth/login.go | — |
+| 2 | DR agent Windows service | Partial | dr-agent/main.go | No service installer, no graceful stop |
+| 3 | Portal restore job status | Missing | — | No UI component, no API endpoint |
+
+Then produce a prioritized gap list — **only Missing and Partial** entries, scored by impact:
+
+```
+HIGH (blocks core value prop):
+  - [capability] — [what's missing]
+
+MEDIUM (degrades experience but workaround exists):
+  - [capability] — [what's missing]
+
+LOW (polish, nice-to-have):
+  - [capability] — [what's missing]
+```
+
+Score impact by: does blocking this prevent the product from being usable at all (HIGH), does it create a poor but functional experience (MEDIUM), or is it an enhancement (LOW)?
+
+---
+
+## Step 4: Update PM Memory
+
+Append findings to `.sf/PM-STRATEGY.md` under an "## Feature Gap Analysis" section:
+
+```markdown
+## Feature Gap Analysis
+
+_Date: YYYY-MM-DD_
+
+### Feature Map Summary
+- Total capabilities identified: N
+- Implemented: N
+- Partial: N  
+- Missing: N
+
+### High-Priority Gaps
+- [capability]: [gap description]
+
+### Recommendation
+[1-2 sentences: which gaps should become the next milestone(s)]
+```
+
+---
+
+## Rules
+
+- **One capability per row** — never bundle multiple behaviors into one entry
+- **Evidence required for Implemented** — "code exists" is not enough without a test
+- **Partial ≠ Done** — if tests are missing or integration is broken, it's Partial
+- **Don't scan every file** — target likely locations by category, not a full crawl
+- **Gap list feeds milestone planning** — HIGH gaps are cannonball candidates for the next milestone
+
+---
+
+## Sources
+
+- Teresa Torres, *Continuous Discovery Habits* — opportunity mapping, feature inventory
+- Bob Moesta — struggling moments as requirements encoded in code
+- Ryan Singer — appetite-based scoping, scope cutting discipline
--- a/src/resources/extensions/sf/skills/pm-planning/SKILL.md
+++ b/src/resources/extensions/sf/skills/pm-planning/SKILL.md
@ -0,0 +1,252 @@
+---
+name: pm-planning
+description: Apply product management thinking when analyzing a codebase to plan milestones. Use during autonomous planning (sf auto, discuss-headless) to discover what needs to be built, prioritize work, and define done. Synthesizes Working Backwards, JTBD, Opportunity-Solution Tree, RICE prioritization, and scoping/cutting frameworks adapted for software development planning.
+---
+
+# PM Planning: Autonomous Codebase-to-Roadmap Thinking
+
+Apply rigorous product management frameworks when analyzing a codebase autonomously to decide what to build next. These frameworks were designed for product managers at Stripe, Figma, Duolingo, Airbnb — adapt them for software planning from evidence in code.
+
+---
+
+## Phase 1: Diagnose Before You Plan
+
+### The Rumelt Framework (Technical Diagnosis)
+Structure your analysis as:
+1. **Diagnosis** — What is the core challenge? What is broken, incomplete, or risky?
+2. **Guiding Policies** — What principles will govern decisions? (e.g., "boring tech over novelty", "test coverage before features")
+3. **Actions** — Specific milestones and slices that follow from the policies
+
+Never start planning without completing the diagnosis. A plan without a diagnosis is just a wish list.
+
+### Struggling Moments = Demand
+Bob Moesta: *"A struggling moment causes demand. Nobody creates a product and then demand follows — the struggling moment exists first."*
+
+When analyzing code, look for struggling moments encoded in the codebase:
+- `TODO`, `FIXME`, `HACK`, `XXX` comments — places devs knew something was wrong
+- Error handling with `panic`, `log.Fatal`, or bare `catch (e) {}` — brittle paths
+- Missing test files or `// test this` comments — unvalidated logic
+- Hardcoded values, magic numbers, missing configuration — future pain
+- Dead code, commented-out features — abandoned attempts
+- Long functions, high cyclomatic complexity — cognitive overload
+
+These are your users' (developers') struggling moments. They are the real requirements.
+
+---
+
+## Phase 2: Working Backwards (Define Done First)
+
+Before writing a single slice, answer: **What does the finished milestone look like?**
+
+Write a mental press release for each milestone:
+- **Headline:** What is now true that wasn't before?
+- **Customer benefit:** Who benefits and how? (for internal tools: who uses this and what can they now do?)
+- **Before/After:** What was the experience before? What is it after?
+- **Evidence of done:** What test, metric, or demo proves it?
+
+Ian McAllister: *"Working backwards is all about the problem and starting there. Teams that do it wrong don't work backwards — they have something they want to build."*
+
+Apply to milestones: don't plan implementation until you can answer "why does this exist?" and "what proves it's complete?"
+
+---
+
+## Phase 3: Opportunity-Solution Tree (Problem Space First)
+
+Teresa Torres framework — always discover the problem space before the solution space.
+
+```
+Desired Outcome (what changes for the better)
+    │
+    ├── Opportunity 1 (problem/gap discovered in codebase)
+    │       ├── Solution A
+    │       └── Solution B
+    │
+    ├── Opportunity 2 (problem/gap)
+    │       ├── Solution A
+    │       └── Solution B
+    │
+    └── Opportunity 3 (problem/gap)
+            └── Solution A
+```
+
+**In codebase analysis, opportunities come from:**
+- Missing test coverage on critical paths
+- Error handling gaps (unhandled edge cases)
+- Performance bottlenecks (slow queries, N+1 patterns)
+- Security gaps (unvalidated inputs, missing auth checks)
+- Missing observability (no metrics, no structured logging)
+- Incomplete features (half-built flows, stub implementations)
+- Integration failures (APIs called but not validated)
+- Documentation gaps that block onboarding
+
+**Anti-patterns:**
+- Do NOT start with solutions ("we need a cache layer") — start with problems ("response times degrade under load")
+- Do NOT confuse tech debt with opportunity — ask whether fixing it changes the outcome
+- Do NOT skip the tree — going straight from diagnosis to implementation skips the problem/solution mapping
+
+---
+
+## Phase 4: JTBD Analysis — Who Is This For?
+
+Even for internal/technical software, apply Jobs-to-be-Done:
+
+**Functional jobs** (what task must be accomplished):
+- "Deploy reliably without fear of breaking production"
+- "Understand what the system is doing when something goes wrong"
+- "Add a new feature without breaking existing ones"
+- "On-board a new engineer in less than a day"
+
+**Social jobs** (how stakeholders want to be perceived):
+- "Demonstrate to leadership that the system is stable and measurable"
+- "Earn trust from ops team that deployments are safe"
+
+**Emotional jobs** (what anxiety is removed):
+- "Not be woken up at 3am by a silent failure"
+- "Feel confident that test suite catches regressions"
+- "Trust the data in dashboards is accurate"
+
+**Map discovered gaps to jobs.** A gap that addresses no job is low priority. A gap that addresses multiple intense jobs is the highest priority.
+
+---
+
+## Phase 5: Prioritization — Cannonballs vs Lead Bullets
+
+Adriel Frederick: *"Have some cannonballs — high-investment, high-impact bets — and lead bullets — incremental improvements. 80% energy on cannonballs, 20% on lead bullets."*
+
+Score opportunities with RICE:
+- **Reach** — how many users/paths/features affected
+- **Impact** — severity if unaddressed (data loss > UX glitch)
+- **Confidence** — how certain are we this matters (evidence in code = high confidence)
+- **Effort** — estimated slice count to address
+
+**Cannonballs** in codebase terms:
+- No test coverage on the main business flow
+- No error recovery in critical path (data loss risk)
+- Missing observability (blind operations)
+- Core feature incomplete (can't demo the product)
+
+**Lead bullets:**
+- Code style inconsistencies
+- Minor performance improvements
+- Documentation completeness
+- Nice-to-have features
+
+**Prioritize cannonballs first. They have asymmetric upside.**
+
+---
+
+## Phase 6: MVP Thinking — Build the Scooter
+
+Eeke de Milliano: *"If you're building an MVP for a car, don't build just the axle — build a scooter. A scooter is a complete, functional, smaller value proposition."*
+
+Each milestone should be a scooter: **a complete, functional, demonstrable thing** — not a half-finished larger thing.
+
+Apply when defining slice boundaries:
+- Does this slice deliver something runnable/testable/demoable by itself?
+- If we stopped after this slice, would we have something of value?
+- Is the slice end-to-end (vertical slice) or horizontal infrastructure with no immediate user value?
+
+**Appetite-based scoping** (Ryan Singer):
+- Fix the time budget first: "This milestone should complete in 2-3 weeks of AI work"
+- Vary scope to fit the appetite — never extend deadlines, cut scope
+- If a feature can't be scoped to fit, it's probably two milestones
+
+**Cut aggressively:**
+- Remove anything that isn't in the critical path to done
+- Remove anything where manual testing can substitute for automation (for now)
+- Remove anything the job doesn't require
+
+---
+
+## Phase 7: Technical Strategy Principles
+
+Will Larson: *"A common strategy that's really good but boring: we only use the tools we already have. Engineers want to introduce new languages and databases. A great strategy for most companies is: use the standard kit."*
+
+When planning milestones:
+- Prefer fixing existing code over introducing new abstractions
+- Prefer tests over documentation (tests are executable documentation)
+- Prefer removing code over adding code when the job can be done either way
+- Prefer simple over clever — if you can't explain it in one sentence, it's too complex
+
+---
+
+## Synthesis: What to Put in CONTEXT.md
+
+A strong CONTEXT.md answers all of these:
+1. **Diagnosis** — What is the core challenge this milestone addresses?
+2. **Desired outcome** — What is different after this milestone? Who benefits?
+3. **Evidence** — What in the codebase (struggling moments, test failures, TODOs) confirms this is real?
+4. **Job addressed** — Whose functional/emotional job does this serve?
+5. **Scooter definition** — What is the complete, demonstrable end state?
+6. **Appetite** — Rough estimate (slices, sessions)
+7. **Cannonball or lead bullet?** — Is this high-leverage or incremental?
+8. **Assumptions** — What did we decide autonomously and why?
+
+---
+
+## Anti-Patterns to Avoid
+
+- **Shiny object trap** (Marily Nika): Don't plan AI features, new frameworks, or architectural rewrites unless there's a diagnosed struggling moment that demands it
+- **Solution-first** (Ian McAllister): Don't start with "we should add X" — start with "users struggle with Y"
+- **All incremental, no cannonballs** (Jackie Bavaro): A roadmap that's only bug fixes and tech debt never moves the product forward
+- **Feature factory** (Teresa Torres): Planning without problem discovery — shipping things without evidence they address real pain
+- **Never killing scope**: If scope grows in investigation, cut elsewhere — appetite is fixed
+- **Unwritten strategy** (Will Larson): Plans that exist only in prompts can't be debugged. Write CONTEXT.md before planning slices.
+
+---
+
+## Persistent Memory: `.sf/PM-STRATEGY.md`
+
+After each planning session (bootstrap, discuss-milestone, research-milestone), write or update `.sf/PM-STRATEGY.md` with your PM analysis. This file is the project's product strategy memory — it persists across all sessions and milestones.
+
+**Write after every planning unit.** Future agents read this to understand what's been decided strategically and why.
+
+### File format
+
+```markdown
+# Product Strategy
+
+_Last updated: YYYY-MM-DD by <unit-type>_
+
+## Diagnosis
+<!-- Rumelt: what is the core challenge? Updated as new struggles discovered. -->
+
+## Opportunity Map
+<!-- OST: top opportunities with evidence, RICE scores, cannonball/lead-bullet classification -->
+| Opportunity | Evidence | Reach | Impact | Confidence | Effort | Score | Class |
+|---|---|---|---|---|---|---|---|
+| ... | ... | ... | ... | ... | ... | ... | cannonball/lead-bullet |
+
+## Jobs Analysis
+<!-- Whose functional/emotional jobs does this product serve? What anxieties are removed? -->
+
+## Guiding Policies
+<!-- Rumelt: principles governing decisions. e.g. "tests before features", "no new dependencies without diagnosis" -->
+
+## Strategic Decisions
+<!-- What was decided and why — supplement to DECISIONS.md with product-level rationale -->
+
+## What Was Deferred and Why
+<!-- Opportunities/features explicitly scoped out. Prevents relitigating the same decisions. -->
+
+## Milestone Sequencing Rationale
+<!-- Why milestones are in this order — what job each unlocks -->
+```
+
+**Rules:**
+- Append new findings, don't overwrite old ones — this is a running log
+- Always update the Opportunity Map when new struggling moments are discovered
+- Always update Guiding Policies when a new principle emerges from analysis
+- Mark deferred items explicitly — "out of scope for now" is valuable signal
+
+---
+
+## Sources
+
+Synthesized from:
+- [lenny-skills (RefoundAI)](https://github.com/RefoundAI/lenny-skills) — 86 PM skills from Lenny's Podcast
+- [Product-Manager-Skills (deanpeters)](https://github.com/deanpeters/Product-Manager-Skills) — 47 skills, Teresa Torres, Geoffrey Moore frameworks
+- Teresa Torres, *Continuous Discovery Habits* — Opportunity Solution Tree
+- Richard Rumelt, *Good Strategy/Bad Strategy* — Diagnosis/Policies/Actions
+- Bob Moesta, *Demand-Side Sales* — Struggling moments and JTBD
+- Ryan Singer, *Shape Up* — Appetite-based scoping