diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 007e517d3..7680d868e 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -50,7 +50,7 @@ Singularity Forge (SF) is an autonomous agent orchestration system. It runs long .sf/worktrees/ — git worktree working directories .sf/auto.lock — crash detection sentinel .sf/metrics.json — token/cost accumulator -.sf/sf.db* — SQLite cache (rebuilt from markdown by importers) +.sf/sf.db* — SQLite canonical structured state, priority order, validation/gate state, and UOK ledgers .sf/STATE.md — derived state cache .sf/notifications.jsonl, .sf/routing-history.json, .sf/self-feedback.jsonl, .sf/repo-meta.json ``` @@ -72,10 +72,89 @@ The symlink case uses a blanket `.sf` gitignore pattern (git cannot traverse sym **Write gate** (`bootstrap/write-gate.ts`): All file writes in autonomous mode pass through a gate. Protected files (CLAUDE.md, CODEBASE.md, certain spec files) require explicit override. +## UOK Dispatch State Machine (Five-Phase Loop) + +UOK orchestrates work through a deterministic five-phase state machine: + +``` +PhaseDiscuss → PhasePlan → PhaseExecute → PhaseMerge → PhaseComplete + ↓ ↓ ↓ ↓ ↓ + (discuss) (plan) (execute) (merge) (finalize) + ↓ ↓ ↓ ↓ ↓ + gates gates gates gates validation + ↓ ↓ ↓ ↓ ↓ + (continue or remediate) +``` + +**Phase details:** + +| Phase | Purpose | Exit Conditions | Failure Path | +|-------|---------|-----------------|--------------| +| **PhaseDiscuss** | Gather project context, requirements, scope | Gates pass (discussion-close gate) | Loop back for more context or escalate | +| **PhasePlan** | Create milestone/slice plans with success criteria | Gates pass (planning-approval gate) | Add remediation slices or replan | +| **PhaseExecute** | Implement tasks through the dispatch sequence | Gates pass (code-quality, test gates) | Isolate failed task, add recovery slices | +| **PhaseMerge** | Integrate slices, run end-to-end tests, merge branches | Gates pass (integration gate) | Add integration-fix slices, retry | +| **PhaseComplete** | Final validation, audit trail, summary, gate completion | Validation passes (acceptance gate) | Add remediation milestone or escalate | + +**Error recovery:** + +- If a gate fails, UOK records the verdict and routes through phase-specific handlers +- Failed gates can trigger automatic remediation slices (new plan → execute loop) +- Stuck-loop detection: if the same unit repeats without progress after N attempts, invoke recovery protocol (timeout, manual review, or skip) +- Crash recovery: `.sf/auto.lock` sentinel + `sf.db` WAL enables recovery from agent crash mid-phase + +## Gate Verdict Semantics + +Every gate runs in parallel and returns one of three verdicts: + +| Verdict | Meaning | Next Action | +|---------|---------|-------------| +| **passed** | Gate question answerable; no concern blocking this phase | Proceed to next phase | +| **failed** | Gate question answerable; concern blocks phase progression | Record failure, optionally add remediation slice(s) | +| **omitted** | Gate question not applicable to this unit (e.g., no auth work → auth gate omitted) | Proceed (gate doesn't apply) | + +**Critical rule:** `omitted` must have a one-line reason (e.g., "no auth surface"). Unexplained omitted verdicts are treated as failures and re-dispatched with explicit instruction to pick `passed` or `failed`. + +## Outcome Learning for Model Selection + +UOK tracks model success/failure per task-type using Bayesian updating: + +``` +P(model_i succeeds | task_type) = (successes + prior) / (total_trials + prior_weight) +``` + +**Mechanism:** + +- After each task completes, UOK logs: `{ model, task_type, succeeded: bool, latency_ms, tokens }` +- Model scores updated dynamically; different models get different confidence per phase/task +- Prior weights prevent early abandonment (new models get benefit of the doubt) +- Used by `benchmark-selector.ts` to route future similar tasks to higher-scoring models + +**Current limitation:** Learning updates episodically (per-task), not continuously; recovery paths don't feed learning back. + +## Self-Evolution Mechanisms + +### Self-Report Collection +Agents and gates file `sf_self_report` with anomalies during dispatch: +- Example: "validation-reviewer prompt lacks explicit rubric for criterion vs. implementation gap" +- Reports captured in `upstream-feedback.jsonl` and `.sf/SELF-FEEDBACK.md` (when dogfooding) +- **Status:** Collection works; triage pipeline incomplete (reports not automatically processed into fixes) + +### Knowledge Compounding +`KNOWLEDGE.md` stores judgment-log entries from completed slices: +- Format: `[when] [what] [why] [confidence]` (e.g., "Python 3.12 incompatible with X library — avoid for now") +- Persists across milestones; can be injected into future dispatch prompts +- **Status:** Storage works; injection not automatic (requires manual configuration) + +### Gate-Based Pattern Detection +Gates can detect and report repeated failure patterns (e.g., "same requirement-validation failure in S01 and S03") +- **Status:** Logic exists per gate; no automatic aggregation across gates + ## Invariants - UOK and the dispatch controller are pure TypeScript — no LLM decisions in the dispatch loop itself. - Each dispatch unit runs in a fresh context — no cross-turn state accumulation. - Planning artifacts are tracked in git; runtime artifacts are never committed. +- DB-backed state is the only executable truth for migrated milestones: planning hierarchy, `sequence` priority/order, validation assessments, gate runs, quality gates, UOK runtime policy, and outcome ledgers all come from SQLite. Markdown/JSON projections are human views, exports, diagnostics, or explicit recovery/import inputs; normal runtime does not fall back to them when `.sf/sf.db` exists and opens. - `SF_RUNTIME_PATTERNS` in `gitignore.ts` is the canonical source of truth for runtime paths. `git-service.ts` (`RUNTIME_EXCLUSION_PATHS`) and `worktree-manager.ts` (`SKIP_*` arrays) must stay synchronized with it. - The user is the end-gate. SF delivers for review, not to production. diff --git a/TODO.md b/TODO.md index 59f3fa29c..f4f863ab4 100644 --- a/TODO.md +++ b/TODO.md @@ -47,6 +47,87 @@ runtime memory or an approved backlog. - **Workflow state machine race protection** — Session transitions during agent_end, idle-wait optimization (gsd-2 commits `71114fccf`, `6d7e4ccb5`, `c162c44bf`, `e3bd04551`). - **Claude-Code CLI Always-Allow persistence** — Grant persistence for non-Bash tools (gsd-2 `a88baeae9`, PR #5096). +### UOK Self-Evolution Research (2026-05-06) + +#### Overview +Research into whether SF's UOK (Unified Operation Kernel) is best-in-class for a self-evolving coder agent. Full research report: see session research folder. + +**Verdict:** UOK is excellent for deterministic autonomous dispatch (beats typical LLM agents, rivals enterprise orchestrators) but only 60-70% complete for true self-evolution. Learning infrastructure exists but isn't actively used. + +#### Critical Finding: Documentation Gap +The implementation has **10+ undocumented features** not explained in ARCHITECTURE.md: +- Five-phase state machine with error recovery paths (PhaseDiscuss → PhasePlan → PhaseExecute → PhaseMerge → PhaseComplete) +- Gate-runner verdict semantics (passed/failed/omitted) + re-dispatch rules +- Outcome learning for model selection (Bayesian blending) +- Stuck-loop detection with recovery thresholds +- Evidence persistence and audit trails +- Sophisticated edge-case handling not documented + +**Action item:** Update ARCHITECTURE.md with full state machine diagram, gate semantics, and recovery paths. + +#### Self-Evolution Status + +**Infrastructure that works:** +- ✅ Self-report collection (sf_self_report captures anomalies during dispatch/validation) +- ✅ Outcome learning (Bayesian model selection per task-type) +- ✅ Knowledge compounding (KNOWLEDGE.md with judgment-log entries) +- ✅ Gate-based pattern detection (gates can detect repeated failures) + +**Feedback loop that's missing:** +- ❌ Triage pipeline — self-reports collected but not processed into fixes +- ❌ Continuous model tuning — learning exists but infrequent, not aggressive +- ❌ Automated knowledge injection — knowledge exists but not used in prompts +- ❌ Cross-gate pattern aggregation — gates run independently, don't see patterns +- ❌ Adaptive thresholds — all timeouts hardcoded, not data-driven +- ❌ Hypothesis testing — no A/B test framework for improvements +- ❌ Regression detection — no metrics monitoring for quality drift + +#### Top 3 Improvements (Quick Wins) + +1. **Close self-report feedback loop** [9/10 impact, 4/10 effort, 2-3 days] + - Auto-triage self-reports, create work items for fixes, promote high-confidence improvements to code + - **Why:** Reports are collected but ignored; this closes the feedback loop + - **Implementation:** Extend commands-todo.js triage logic to process sf_self_report events + +2. **Activate continuous model learning** [8/10 impact, 5/10 effort, 3-4 days] + - Track model success/failure per task type + latency + cost; auto-demote failing models; A/B test new models on low-risk tasks + - **Why:** Learning exists but is dormant; this makes dispatch adaptive + - **Implementation:** Enhance benchmark-selector.ts + model-router logic with aggressive per-task-type tracking + +3. **Automate knowledge injection** [7/10 impact, 4/10 effort, 2-3 days] + - Auto-query KNOWLEDGE.md for relevance during milestone planning; inject high-confidence learnings; flag contradictions + - **Why:** Knowledge exists but isn't used; this makes it actionable + - **Implementation:** Add to auto-prompts.js knowledge-injection stage; use semantic similarity scoring + +**Quick-win total:** ~8-10 days for high-leverage improvements that activate the learning loop. + +#### Additional Improvements (Medium-Term, 1-2 Months) + +4. **Continuous gate pattern aggregation** [8/10 impact, 6/10 effort, 3-4 days] — After each phase, detect common gate failure themes across all gates; aggregate into consolidated self-reports; suggest architectural fixes. + +5. **Adaptive timeout tuning** [7/10 impact, 6/10 effort, 3-4 days] — Replace hardcoded timeouts with data-driven values based on task execution history; auto-adjust per task-type. + +6. **Hypothesis testing framework** [9/10 impact, 7/10 effort, 4-5 days] — A/B test improvements on low-stakes tasks; roll back if they introduce regressions; never ship untested changes. + +7. **Cross-milestone federated learning** [8/10 impact, 9/10 effort, 8-10 days] — Share generalizable learnings across projects (same org); test on similar projects first; expand based on results. + +8. **Regression detection & prevention** [7/10 impact, 8/10 effort, 5-6 days] — Track key metrics (success rate, latency, cost, gate failures) across milestones; alert on regressions; auto-rollback bad changes. + +9. **Semantic drift detection** [6/10 impact, 7/10 effort, 4-5 days] — Detect when prompts/gate logic have drifted from original intent; file alerts; suggest reverting or documenting. + +10. **Self-hosted telemetry & profiling** [5/10 impact, 8/10 effort, 4-5 days] — When SF runs on itself (dogfooding), profile which phases/gates/model-selections take longest; prioritize optimizations. + +**Medium-term total:** ~25-30 days for comprehensive self-evolution roadmap. + +#### Documentation That Should Be Updated + +- [ ] ARCHITECTURE.md — Full state machine diagram with phase transitions and error recovery paths +- [ ] docs/dev/ADR-* — Document gate verdict semantics (passed/failed/omitted) and re-dispatch behavior +- [ ] User docs — Explain outcome learning, model selection tuning, knowledge compounding workflow +- [ ] Runbook — Stuck-loop detection, timeout adjustment, recovery paths +- [ ] Design guide — Best practices for implementing custom gates with pattern detection +- [ ] New section: "Self-Evolution Architecture" explaining feedback loops, learning mechanisms, and how to extend them + ## Processed Notes - SF auto-loop hardening was converted into SF milestone state on 2026-05-02. diff --git a/TRIAGE_README.md b/TRIAGE_README.md new file mode 100644 index 000000000..ea0c8e13c --- /dev/null +++ b/TRIAGE_README.md @@ -0,0 +1,53 @@ +# TODO.md Triage Instructions + +## What's New + +TODO.md now contains two major sections ready for triage: + +1. **Feature Gaps & Limitations** — 40+ specific gaps identified in the codebase +2. **UOK Self-Evolution Research** — 10 prioritized improvements for SF's self-evolution capabilities + +## How to Triage + +When you have Node 24.15.0+ available: + +```bash +cd /home/mhugo/code/singularity-forge + +# Run the triage command +sf todo triage + +# Or if using npm/nvm +nvm use 24 +npm exec sf -- todo triage +``` + +## What Triage Does + +The triage tool will: +1. Parse TODO.md +2. Extract items into structured `.sf/triage/` artifacts +3. Propose categorization and priorities +4. Show you a review interface +5. Either commit to backlog or reset TODO.md to empty dump inbox + +## Key Items to Watch For + +The UOK Self-Evolution section has **3 high-impact quick wins** (8-10 days total): + +1. Close self-report feedback loop [9/10 impact, 2-3 days] +2. Activate continuous model learning [8/10 impact, 3-4 days] +3. Automate knowledge injection [7/10 impact, 2-3 days] + +These should be prioritized if you want to activate SF's learning loop. + +## Full Research Report + +See: `/home/mhugo/snap/copilot-cli/38/.copilot/session-state/2514fa98-076d-48d2-a1f9-c3fd77c4a82a/research/is-our-uok-the-best-for-a-self-evolving-coder-what.md` + +This contains: +- Executive summary +- Detailed analysis of UOK implementation vs. documentation +- 10 improvement suggestions with feasibility assessment +- Competitive analysis (vs. other orchestration systems) +- 15+ citations to code and design docs diff --git a/docs/dev/UOK-SELF-EVOLUTION.md b/docs/dev/UOK-SELF-EVOLUTION.md new file mode 100644 index 000000000..a7b65d2a3 --- /dev/null +++ b/docs/dev/UOK-SELF-EVOLUTION.md @@ -0,0 +1,301 @@ +# UOK Self-Evolution Architecture + +This document explains how Singularity Forge's UOK (Unified Operation Kernel) implements self-evolution — the ability to detect its own failures, learn from them, and improve its own heuristics and dispatch logic. + +## Status Summary + +**Current state:** 60-70% complete. Infrastructure exists; learning loop not fully activated. + +**What works:** +- ✅ Self-report collection during dispatch/validation +- ✅ Outcome learning for model selection (Bayesian) +- ✅ Knowledge compounding (KNOWLEDGE.md) +- ✅ Gate-based pattern detection capability + +**What's missing:** +- ❌ Automated triage pipeline (reports not processed into fixes) +- ❌ Continuous model tuning (learning episodic, not aggressive) +- ❌ Automated knowledge injection (knowledge not used in prompts) +- ❌ Cross-gate pattern aggregation (gates run independently) +- ❌ Adaptive thresholds (timeouts hardcoded, not data-driven) +- ❌ Hypothesis testing (no A/B test framework for improvements) +- ❌ Regression detection (no metrics monitoring) + +--- + +## 1. Self-Report Collection + +**Purpose:** Capture SF-internal observations when something unexpected happens during dispatch, validation, or gate runs. + +**Implementation:** Agents and gates call `sf_self_report(issue, severity, context)` + +**Examples of self-reports filed in production:** +- "validation-reviewer prompt lacks explicit rubric for criterion vs. implementation gap" [low] +- "model timeout on large input (>100K tokens)" [warning] +- "gate inconsistency: requirement-coverage gate failed in S02 but passed in S03 for same requirement" [warning] + +**Storage:** +- Runtime: `~/.sf/agent/upstream-feedback.jsonl` (per-session) +- Dogfooding: `.sf/SELF-FEEDBACK.md` (when SF runs on itself) + +**Current limitation:** +- Reports are collected but not systematically processed +- No automatic triage, dedup, or promotion to code fixes +- No feedback-to-code pipeline + +**Improvement:** See "Top 3 Quick Wins" below. + +--- + +## 2. Outcome Learning for Model Selection + +**Purpose:** Track which models succeed/fail on different task types, then route future tasks to higher-scoring models. + +**Mechanism:** + +UOK maintains a per-task-type model performance matrix: + +``` +model_scores[task_type] = { + claude-sonnet: { successes: 42, failures: 3, latency_ms: [450, 520, ...] }, + claude-opus: { successes: 8, failures: 2, latency_ms: [800, 850, ...] }, + minimax: { successes: 15, failures: 10, latency_ms: [350, 400, ...] } +} +``` + +Bayesian update after each task: + +``` +P(model_i succeeds | task_type) = (successes + prior_weight) / (total_trials + prior_weight) +``` + +- Default priors give new/experimental models benefit of the doubt +- Different priors for different model classes (Claude gets higher prior than experimental) +- Used by `benchmark-selector.ts` to pick the best model for next task + +**Location:** Computed during phase transitions; stored in `sf.db` outcome logs. + +**Current limitations:** +- Learning updates episodically (per-task completion), not continuously +- Success/failure is binary — doesn't distinguish "slow success" from "fast success" +- Only applied to task dispatch, not to gate routing +- Recovery paths don't feed learning back + +**Improvement:** Make learning aggressive per-task-type with latency/cost tracking. + +--- + +## 3. Knowledge Compounding + +**Purpose:** Extract high-confidence learnings from completed work and make them available to future milestones. + +**Storage:** `KNOWLEDGE.md` with structured judgment-log entries. + +**Format:** + +```markdown +## [2026-05-06] Python 3.12 stdlib compatibility + +**Verdict:** Active issue — avoid for now + +**Evidence:** Task T02 in M010/S03 discovered that `asyncore` module removed in Python 3.12. Affects legacy integrations. + +**Confidence:** 0.95 (observed failure in live deployment) + +**Recommendation:** Constrain to Python <3.11 in requirements.txt; add explicit warning for users on 3.12. +``` + +**How it should work (ideally):** +1. After slice completion, `memory-extractor.ts` distills high-confidence learnings (confidence >0.8) +2. Next milestone dispatch checks KNOWLEDGE.md for relevance +3. Relevant knowledge injected into dispatch prompts automatically +4. Contradictory knowledge flagged as potential architectural drift + +**Current state:** +- Storage works (KNOWLEDGE.md well-formatted) +- Extraction works (memory-extractor.ts analyzes task results) +- Injection is **manual** (must explicitly configure in prompts) +- No automatic relevance matching +- No conflict resolution for contradictory knowledge + +**Improvement:** Automate knowledge injection with semantic relevance scoring. + +--- + +## 4. Gate-Based Pattern Detection + +**Purpose:** Gates can detect and report repeated failure patterns, signaling potential design flaws. + +**Example:** +- Requirement-coverage gate fails in S01 (requirement X not covered) +- Requirement-coverage gate fails in S03 (same requirement X not covered) +- Gate files self-report: "Requirement X failed coverage in multiple slices — suggests design flaw or missing slice" + +**Current state:** +- Logic exists in individual gates +- Each gate runs independently (no cross-gate pattern aggregation) +- Patterns must be explicitly coded in gate logic (not automatic) +- No framework for gate authors to easily add pattern detection + +**Improvement:** Add cross-gate pattern aggregation + automatic theme detection. + +--- + +## Top 3 Quick Wins (8-10 Days Total) + +### 1. Close Self-Report Feedback Loop [9/10 impact, 4/10 effort, 2-3 days] + +**What:** Create an automated triage pipeline that processes self-reports into actionable fixes. + +**Implementation:** +- Extend `commands-todo.js` triage logic to parse `upstream-feedback.jsonl` +- Triage rules: + - Dedup identical reports (same issue filed multiple times) + - Classify by severity: blocker | warning | suggestion + - Auto-create backlog work items for blockers/warnings + - For high-confidence fixes (e.g., "prompt lacks rubric"), propose the fix directly +- Promote fixes into code via new SF slice + +**Why:** Reports are collected but ignored. This closes the feedback loop. + +**Code locations:** +- `src/resources/extensions/sf/commands-handlers.js` (sf_self_report implementation) +- `src/resources/extensions/sf/commands/handlers/todo.js` (triage logic to extend) +- `src/resources/extensions/sf/commands-todo.js` (triage prompt + tool) + +--- + +### 2. Activate Continuous Model Learning [8/10 impact, 5/10 effort, 3-4 days] + +**What:** Make model selection adaptive, per-task-type, with failure analysis and automatic demotion. + +**Current state:** +- Outcome tracking exists but learning is infrequent +- All model routing decisions are mostly static (based on configuration, not history) + +**Improvements:** +- Track per-task-type: success rate, latency, cost, token efficiency +- Auto-demote models that fail >50% on specific task types +- A/B test new models against incumbent on low-risk tasks +- Log detailed failure analytics (why did this model fail? timeouts? quality?) + +**Why:** Learning exists but is dormant; this makes dispatch adaptive. + +**Code locations:** +- `packages/pi-ai/src/model-router.ts` (model selection logic) +- `src/auto-dispatch.ts` (outcome logging, task tracking) +- `src/resources/extensions/sf/commands/benchmark-selector.ts` (model scoring display) + +--- + +### 3. Automate Knowledge Injection [7/10 impact, 4/10 effort, 2-3 days] + +**What:** During milestone planning, automatically query KNOWLEDGE.md for relevant learnings and inject them into dispatch prompts. + +**Current state:** +- KNOWLEDGE.md exists and is populated +- Agents never see it (unless manually configured) + +**Improvements:** +- At planning time, query KNOWLEDGE.md with semantic similarity scoring +- Inject high-confidence (>0.8) relevant knowledge into `execute-task`, `plan-slice` prompts +- Flag contradictory knowledge (e.g., "avoid Python 3.12" vs. "adopt Python 3.12") for review +- Track which knowledge was actually used (feedback to knowledge compounding) + +**Why:** Knowledge exists but isn't used; this makes it actionable. + +**Code locations:** +- `src/resources/extensions/sf/auto-prompts.js` (where prompts are loaded; add knowledge injection here) +- `src/resources/extensions/sf/prompts/execute-task.md`, `plan-slice.md` (templates that should reference {{knowledgeInjection}}) +- New module: `src/resources/extensions/sf/knowledge-injector.ts` (semantic matching logic) + +--- + +## Additional Improvements (Medium-Term, 1-2 Months) + +### 4. Continuous Gate Pattern Aggregation [8/10 impact, 6/10 effort, 3-4 days] +After each phase, scan all gate failures for common themes. Aggregate into consolidated self-reports. Suggest architectural fixes. + +### 5. Adaptive Timeout Tuning [7/10 impact, 6/10 effort, 3-4 days] +Replace hardcoded timeouts with data-driven values based on task execution history. Auto-adjust per task-type. + +### 6. Hypothesis Testing Framework [9/10 impact, 7/10 effort, 4-5 days] +A/B test improvements on low-stakes tasks. Roll back if they introduce regressions. Never ship untested changes. + +### 7. Cross-Milestone Federated Learning [8/10 impact, 9/10 effort, 8-10 days] +Share generalizable learnings across projects (same org). Test on similar projects first. + +### 8. Regression Detection & Prevention [7/10 impact, 8/10 effort, 5-6 days] +Track metrics across milestones. Alert on regressions. Auto-rollback bad changes. + +### 9. Semantic Drift Detection [6/10 impact, 7/10 effort, 4-5 days] +Detect when prompts/gate logic have drifted from original intent. Suggest reverting or documenting. + +### 10. Self-Hosted Telemetry [5/10 impact, 8/10 effort, 4-5 days] +When SF runs on itself (dogfooding), profile which phases/gates take longest. Prioritize optimizations. + +--- + +## Architecture Diagram + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ UOK Dispatch Loop │ +└─────────────────────────────────────────────────────────────────┘ + ↓ + ┌────────────────────────┐ + │ PhaseDiscuss/Plan/ │ + │ Execute/Merge/ │ + │ Complete │ + └────────────────────────┘ + ↓ ↓ + ┌──────────────────┐ ┌──────────────────┐ + │ Outcome │ │ Gates Run │ + │ Logging │ │ Parallel │ + └──────────────────┘ └──────────────────┘ + ↓ ↓ + ┌──────────────────────────────────────┐ + │ sf.db: Outcome Ledger + Gate Results│ + └──────────────────────────────────────┘ + ↓ + ┌──────────────────────────────────────────────┐ + │ Self-Report Collection │ + │ (agents + gates file anomalies) │ + └──────────────────────────────────────────────┘ + ↓ ↓ ↓ + ┌────────┐ ┌────────────┐ ┌─────────────┐ + │ TBD: │ │ Learning: │ │ Knowledge: │ + │ Triage │ │ Model │ │ Compounding │ + │ Loop │ │ Selection │ │ (KOWLEDGE.md) + └────────┘ └────────────┘ └─────────────┘ + ↓ ↓ + ┌─────────────────────────────────────────┐ + │ TBD: Automated │ + │ Knowledge Injection │ + │ (into next dispatch) │ + └─────────────────────────────────────────┘ + +(TBD = To Be Done; strikethrough items are implemented but inactive) +``` + +--- + +## How to Contribute + +To improve self-evolution, pick one of the quick wins above: + +1. **Study the code:** Understand how self-reports are filed, how outcome logging works, how KNOWLEDGE.md is structured +2. **Write a failing test:** Define expected behavior (e.g., "when self-report severity is 'blocker', it creates a backlog item") +3. **Implement the improvement:** Follow SF coding conventions (see CONTRIBUTING.md) +4. **Test thoroughly:** Especially recovery paths and edge cases +5. **Document:** Update this file and ARCHITECTURE.md as behavior changes + +--- + +## References + +- **Outcome Learning:** `src/auto-dispatch.ts` (outcome logging), `packages/pi-ai/src/model-router.ts` (model selection) +- **Self-Reports:** `src/resources/extensions/sf/commands-handlers.js` (sf_self_report), `upstream-feedback.jsonl` (storage) +- **Knowledge:** `KNOWLEDGE.md` (storage), `src/resources/extensions/sf/memory-extractor.js` (extraction) +- **Gates:** `src/resources/extensions/sf/prompts/gate-evaluate.md` (gate orchestration) +- **TODO:** See `TODO.md` and `BACKLOG.md` for prioritized work diff --git a/docs/dev/architecture.md b/docs/dev/architecture.md index 239e4fd10..1195cd228 100644 --- a/docs/dev/architecture.md +++ b/docs/dev/architecture.md @@ -30,7 +30,7 @@ vscode-extension/ VS Code extension — chat participant (@sf), sidebar ### State Lives on Disk -Structured `.sf` state is the runtime source of truth. Auto mode reads it, writes it, and advances based on what it finds. Markdown planning files are human projections when structured state exists. No in-memory state survives across sessions. This enables crash recovery, multi-terminal steering, and session resumption. +Structured `.sf` state is the runtime source of truth. For migrated milestones, `.sf/sf.db` is authoritative for hierarchy, `sequence` priority/order, validation assessments, gates, UOK lifecycle, and outcome ledgers. Markdown and JSON planning files are generated views, exports, or explicit recovery/import inputs; normal auto mode does not fall back to them when the DB exists and opens. No in-memory state survives across sessions. This enables crash recovery, multi-terminal steering, and session resumption. ### Two-File Loader Pattern @@ -154,7 +154,7 @@ Phase skipping (from token profile) gates steps 2-3: if a phase is skipped, the | `visualizer-data.ts` | Data loading for visualizer tabs | | `visualizer-views.ts` | Tab renderers (progress, deps, metrics, timeline, discussion status) | | `metrics.ts` | Token and cost tracking ledger | -| `state.ts` | State derivation from disk | +| `state.ts` | DB-authoritative state derivation with filesystem fallback only for unmigrated/recovery planning artifacts | | `session-lock.ts` | OS-level exclusive session locking (proper-lockfile) | | `crash-recovery.ts` | Lock file management for crash detection and recovery | | `preferences.ts` | Preference loading, merging, validation | diff --git a/src/resources/extensions/sf/auto-dispatch.js b/src/resources/extensions/sf/auto-dispatch.js index d423ceda2..7996cf7c3 100644 --- a/src/resources/extensions/sf/auto-dispatch.js +++ b/src/resources/extensions/sf/auto-dispatch.js @@ -72,6 +72,7 @@ import { createScheduleStore } from "./schedule/schedule-store.js"; import { getMilestone, getMilestoneSlices, + getMilestoneValidationAssessment, getPendingGates, getSlice, getSliceTasks, @@ -112,6 +113,33 @@ function missingSliceStop(mid, phase) { level: "error", }; } +async function readMilestoneValidationForDispatch(basePath, mid) { + if (isDbAvailable()) { + const assessment = getMilestoneValidationAssessment(mid); + const verdict = + typeof assessment?.status === "string" && assessment.status.trim() + ? assessment.status.trim().toLowerCase() + : undefined; + if (verdict) { + return { + verdict, + content: assessment.full_content ?? "", + path: + assessment.path ?? resolveMilestoneFile(basePath, mid, "VALIDATION"), + }; + } + return null; + } + const validationFile = resolveMilestoneFile(basePath, mid, "VALIDATION"); + if (!validationFile) return null; + const content = await loadFile(validationFile); + if (!content) return null; + return { + verdict: extractVerdict(content), + content, + path: validationFile, + }; +} function canonicalPlanStop(mid, plan) { return { action: "stop", @@ -1426,11 +1454,21 @@ export const DISPATCH_RULES = [ // Safety guard (#2675): completion is only automatic after a pass verdict. // Non-pass terminal verdicts are still terminal for validation loops, but // they are not a license to close the milestone. - const validationFile = resolveMilestoneFile(basePath, mid, "VALIDATION"); - if (validationFile) { - const validationContent = await loadFile(validationFile); - if (validationContent) { - const verdict = extractVerdict(validationContent); + const validation = await readMilestoneValidationForDispatch( + basePath, + mid, + ); + if (!validation && isDbAvailable()) { + return { + action: "stop", + reason: `Cannot complete milestone ${mid}: DB has no milestone-validation assessment row. Runtime does not fall back to VALIDATION.md when .sf/sf.db is available; run validate-milestone or sf recover.`, + level: "error", + }; + } + if (validation) { + const validationContent = validation.content; + if (validation.verdict) { + const verdict = validation.verdict; if (verdict && verdict !== "pass") { if (verdict === "needs-attention") { const attentionPlan = @@ -1443,7 +1481,7 @@ export const DISPATCH_RULES = [ writeValidationAttentionMarker(basePath, mid, { milestoneId: mid, createdAt: new Date().toISOString(), - source: validationFile, + source: validation.path, remediationRound: parseValidationRemediationRound(validationContent), }); diff --git a/src/resources/extensions/sf/auto/phases.js b/src/resources/extensions/sf/auto/phases.js index 77a13c3b0..2d79a5c6c 100644 --- a/src/resources/extensions/sf/auto/phases.js +++ b/src/resources/extensions/sf/auto/phases.js @@ -40,6 +40,7 @@ import { buildAutonomousSolverMissingCheckpointRepairPrompt, buildAutonomousSolverPromptBlock, buildAutonomousSolverSteeringPromptBlock, + classifyAutonomousSolverMissingCheckpointFailure, consumePendingAutonomousSolverSteering, getConfiguredAutonomousSolverMaxIterations, recordAutonomousSolverMissingCheckpointRetry, @@ -68,6 +69,7 @@ import { rollbackToCheckpoint, } from "../safety/git-checkpoint.js"; import { resolveSafetyHarnessConfig } from "../safety/safety-harness.js"; +import { recordSelfFeedback } from "../self-feedback.js"; import { getMilestoneSlices, getSliceTaskCounts, @@ -2144,8 +2146,16 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { unitType, unitId, ); - if (solverAssessment.action === "missing-checkpoint-retry") { - recordAutonomousSolverMissingCheckpointRetry(s.basePath, unitType, unitId); + while (solverAssessment.action === "missing-checkpoint-retry") { + const diagnosis = classifyAutonomousSolverMissingCheckpointFailure( + currentUnitResult.event?.messages ?? [], + ); + recordAutonomousSolverMissingCheckpointRetry( + s.basePath, + unitType, + unitId, + diagnosis, + ); deps.emitJournalEvent({ ts: new Date().toISOString(), flowId: ic.flowId, @@ -2155,10 +2165,13 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { unitType, unitId, iteration: solverAssessment.state?.iteration, + repairAttempt: solverAssessment.repairAttempt, + maxRepairAttempts: solverAssessment.maxRepairAttempts, + classification: diagnosis.classification, }, }); ctx.ui.notify( - `Autonomous solver checkpoint missing for ${unitType} ${unitId}; redispatching one repair turn.`, + `Autonomous solver checkpoint missing for ${unitType} ${unitId}; repair attempt ${solverAssessment.repairAttempt}/${solverAssessment.maxRepairAttempts} (${diagnosis.classification}).`, "warning", ); currentUnitResult = await runUnit( @@ -2171,6 +2184,9 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { solverAssessment.state, unitType, unitId, + diagnosis, + solverAssessment.repairAttempt, + solverAssessment.maxRepairAttempts, ), ); s.lastUnitAgentEndMessages = currentUnitResult.event?.messages ?? null; @@ -2193,6 +2209,56 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { }); } if (solverAssessment.action === "pause") { + const missingCheckpointDiagnosis = + solverAssessment.reason === "solver-missing-checkpoint" + ? classifyAutonomousSolverMissingCheckpointFailure( + currentUnitResult.event?.messages ?? [], + ) + : null; + if (missingCheckpointDiagnosis) { + try { + const feedback = recordSelfFeedback( + { + kind: "solver-missing-checkpoint", + severity: "high", + summary: `Autonomous solver failed to checkpoint after ${solverAssessment.repairAttempts ?? "multiple"} repair attempt(s): ${missingCheckpointDiagnosis.classification}`, + evidence: [ + `unit=${unitType} ${unitId}`, + `classification=${missingCheckpointDiagnosis.classification}`, + `summary=${missingCheckpointDiagnosis.summary}`, + `evidencePath=.sf/runtime/autonomous-solver/LOOP.md`, + "", + missingCheckpointDiagnosis.evidence ?? "", + ].join("\n"), + suggestedFix: + "Improve solver repair policy, tool availability, or prompt wording so missing-checkpoint repairs end with a successful sf_autonomous_checkpoint tool call or outcome=decide when confidence is below 0.98.", + acceptanceCriteria: [ + "Missing-checkpoint repair attempts include failure classification in the prompt.", + "Repeated repair failures file self-feedback automatically.", + "Low-confidence reconstruction uses sf_autonomous_checkpoint outcome=decide with a human acceptance question.", + ], + occurredIn: { unitType, unitId }, + source: "runtime", + }, + s.basePath, + ); + deps.emitJournalEvent({ + ts: new Date().toISOString(), + flowId: ic.flowId, + seq: ic.nextSeq(), + eventType: "solver-missing-checkpoint-self-feedback", + data: { + unitType, + unitId, + classification: missingCheckpointDiagnosis.classification, + selfFeedbackId: feedback?.entry?.id, + blocking: feedback?.blocking, + }, + }); + } catch { + // self-feedback is observability; never mask the solver pause + } + } const reason = solverCheckpoint?.outcome === "decide" ? (solverCheckpoint.decisionQuestion ?? solverCheckpoint.summary) @@ -2215,6 +2281,7 @@ export async function runUnitPhase(ic, iterData, loopState, sidecarItem) { maxIterations: solverAssessment.state?.maxIterations, remainingItems: solverCheckpoint?.remainingItems ?? [], evidencePath: ".sf/runtime/autonomous-solver/LOOP.md", + ...(missingCheckpointDiagnosis ? { missingCheckpointDiagnosis } : {}), }, }); ctx.ui.notify( diff --git a/src/resources/extensions/sf/autonomous-solver.js b/src/resources/extensions/sf/autonomous-solver.js index 89ff29554..6118ad168 100644 --- a/src/resources/extensions/sf/autonomous-solver.js +++ b/src/resources/extensions/sf/autonomous-solver.js @@ -28,6 +28,7 @@ const MAX_RENDERED_ITEMS = 12; const DEFAULT_SOLVER_MAX_ITERATIONS = 30000; const MIN_SOLVER_MAX_ITERATIONS = 1; const MAX_SOLVER_MAX_ITERATIONS = 100000; +const DEFAULT_MISSING_CHECKPOINT_REPAIR_ATTEMPTS = 4; function solverDir(basePath) { return join(sfRoot(basePath), "runtime", "autonomous-solver"); @@ -221,7 +222,9 @@ export function buildAutonomousSolverPromptBlock(state) { "", "This is SF's built-in solver loop. It is not a separate Ralph workflow. Work one bounded, useful chunk; preserve enough state for the next autonomous iteration to continue without guessing.", "", - "Before ending the turn, call `sf_autonomous_checkpoint` with:", + "Hard requirement: before ending the turn, call the actual `sf_autonomous_checkpoint` tool. Writing SUMMARY.md, LOOP.md, task files, chat prose, or any other artifact is useful evidence, but it is not a checkpoint and does not satisfy this requirement.", + "", + "Call `sf_autonomous_checkpoint` with:", '- `outcome: "complete"` only when this unit\'s normal completion tool/artifact is also done.', '- `outcome: "continue"` when you made real progress but more autonomous iterations are needed.', '- `outcome: "blocked"` when the next step cannot proceed without unavailable facts, credentials, or a broken environment.', @@ -239,6 +242,7 @@ export function buildAutonomousSolverPromptBlock(state) { "", "If you are executing an `execute-task` unit and the task is finished, `sf_task_complete` remains mandatory; `sf_autonomous_checkpoint` does not replace it.", "If you need another iteration, leave exact remaining items in the checkpoint rather than ending with vague prose.", + "Your final autonomous action should be the checkpoint tool call unless a required completion tool such as sf_task_complete must be called immediately before it.", ].join("\n"); } @@ -328,22 +332,123 @@ export function recordAutonomousSolverMissingCheckpointRetry( basePath, unitType, unitId, + diagnosis = null, ) { const state = readJson(statePath(basePath)); if (!sameUnit(state, unitType, unitId)) return null; + const ts = nowIso(); + const attempts = getMissingCheckpointRepairAttempts(state); + const nextAttempt = attempts.length + 1; const nextState = { ...state, status: "running", - updatedAt: nowIso(), + updatedAt: ts, + missingCheckpointRepairs: [ + ...attempts, + { + attempt: nextAttempt, + iteration: state.iteration, + ts, + mode: missingCheckpointRepairMode(nextAttempt), + ...(diagnosis ? { diagnosis } : {}), + }, + ], missingCheckpointRetry: { iteration: state.iteration, - ts: nowIso(), + ts, + attempt: nextAttempt, + mode: missingCheckpointRepairMode(nextAttempt), + ...(diagnosis ? { diagnosis } : {}), }, }; writeState(basePath, nextState); return nextState; } +/** + * Classify why a solver turn omitted the checkpoint tool. + * + * Purpose: make the missing-checkpoint repair prompt adaptive and turn repeated + * model failure patterns into self-feedback instead of manual operator forensics. + * + * Consumer: runUnitPhase before repair redispatch and before missing-checkpoint + * pause/self-feedback. + */ +export function classifyAutonomousSolverMissingCheckpointFailure(messages) { + const text = stringifyMessages(messages); + const lower = text.toLowerCase(); + if (!text.trim()) { + return { + classification: "no-transcript", + summary: "No agent-end transcript was available to classify.", + evidence: "", + }; + } + const mentionsCheckpoint = lower.includes("sf_autonomous_checkpoint"); + const mentionsToolUnavailable = + /(unknown|unavailable|not available|not found|no such) tool/.test(lower) || + (lower.includes("sf_autonomous_checkpoint") && + /(unavailable|not available|not found|unknown)/.test(lower)); + const mentionsToolFailure = + lower.includes("error in sf_autonomous_checkpoint") || + (lower.includes("sf_autonomous_checkpoint") && + /(failed|error|exception|invalid)/.test(lower)); + const mentionsFileSubstitute = + /\bsummary\.md\b/i.test(text) || + /\bloop\.md\b/i.test(text) || + /\bt\d+[-_].*summary\.md\b/i.test(text) || + lower.includes("checkpoint is now saved to") || + lower.includes("checkpoint saved to") || + lower.includes("summary file"); + const falselyClaimsSaved = + (lower.includes("checkpoint") || + lower.includes("sf_autonomous_checkpoint")) && + /(saved|recorded|complete|now saved)/.test(lower); + if (mentionsToolUnavailable) { + return { + classification: "checkpoint-tool-unavailable", + summary: "The transcript suggests the checkpoint tool was unavailable.", + evidence: truncateEvidence(text), + }; + } + if (mentionsToolFailure) { + return { + classification: "checkpoint-tool-failed", + summary: "The transcript suggests the checkpoint tool call failed.", + evidence: truncateEvidence(text), + }; + } + if (mentionsFileSubstitute) { + return { + classification: "file-substituted-for-checkpoint", + summary: + "The agent wrote or referenced a summary/projection file instead of calling the checkpoint tool.", + evidence: truncateEvidence(text), + }; + } + if (falselyClaimsSaved) { + return { + classification: "claimed-checkpoint-without-tool", + summary: + "The agent claimed checkpoint completion without a successful checkpoint record.", + evidence: truncateEvidence(text), + }; + } + if (mentionsCheckpoint) { + return { + classification: "mentioned-checkpoint-without-tool", + summary: + "The agent discussed sf_autonomous_checkpoint but did not record a checkpoint.", + evidence: truncateEvidence(text), + }; + } + return { + classification: "no-checkpoint-tool-call", + summary: "The agent ended without calling sf_autonomous_checkpoint.", + evidence: truncateEvidence(text), + }; +} + /** * Classify the completed solver turn into the next loop action. * @@ -368,20 +473,24 @@ export function assessAutonomousSolverTurn(basePath, unitType, unitId) { checkpoint?.unitId === unitId && Number(checkpoint?.iteration) === Number(state.iteration); if (!hasCurrentCheckpoint) { - const alreadyRetried = - Number(state.missingCheckpointRetry?.iteration) === - Number(state.iteration); - if (alreadyRetried) { + const repairAttempts = getMissingCheckpointRepairAttempts(state).filter( + (attempt) => Number(attempt.iteration) === Number(state.iteration), + ).length; + if (repairAttempts >= DEFAULT_MISSING_CHECKPOINT_REPAIR_ATTEMPTS) { return { action: "pause", reason: "solver-missing-checkpoint", state, + repairAttempts, + maxRepairAttempts: DEFAULT_MISSING_CHECKPOINT_REPAIR_ATTEMPTS, }; } return { action: "missing-checkpoint-retry", reason: "solver-missing-checkpoint", state, + repairAttempt: repairAttempts + 1, + maxRepairAttempts: DEFAULT_MISSING_CHECKPOINT_REPAIR_ATTEMPTS, }; } if ( @@ -507,15 +616,97 @@ export function buildAutonomousSolverMissingCheckpointRepairPrompt( state, unitType, unitId, + diagnosis = null, + repairAttempt = 1, + maxRepairAttempts = DEFAULT_MISSING_CHECKPOINT_REPAIR_ATTEMPTS, ) { - return [ + const mode = missingCheckpointRepairMode(repairAttempt); + const lines = [ "## Checkpoint Required", "", `Your previous autonomous turn for ${unitType} ${unitId} ended without calling sf_autonomous_checkpoint for iteration ${state?.iteration ?? "unknown"}.`, - "Do not continue implementation work in this repair turn.", - "Inspect the work you just performed, then call sf_autonomous_checkpoint with the correct outcome and all eight PDD fields.", + `Repair attempt: ${repairAttempt} of ${maxRepairAttempts}.`, + `Repair mode: ${mode}.`, + ]; + if (diagnosis?.classification) { + lines.push( + "", + "Detected failure pattern:", + `- ${diagnosis.classification}: ${diagnosis.summary ?? "missing checkpoint"}`, + ); + } + if (repairAttempt <= 1) { + lines.push("Do not continue implementation work in this repair turn."); + } else { + lines.push( + "You may inspect transcript context and existing artifacts to reconstruct the checkpoint accurately; do not perform unrelated implementation work.", + ); + } + lines.push( + "Inspect the work already performed, then call the actual sf_autonomous_checkpoint tool with the correct outcome and all eight PDD fields.", + "Do not write a summary file as a substitute. Do not say the checkpoint is saved unless the sf_autonomous_checkpoint tool call succeeds.", + "Your final action in this repair turn must be the sf_autonomous_checkpoint tool call.", + ); + if (repairAttempt >= 2) { + lines.push( + "If artifacts show real progress, checkpoint that progress even if the prior turn did not explain it cleanly.", + ); + } + if (repairAttempt >= 3) { + lines.push( + 'If your confidence that the reconstructed checkpoint is correct is below 0.98, call sf_autonomous_checkpoint with outcome="decide" and put the human acceptance question in decisionQuestion.', + ); + } + if (repairAttempt >= maxRepairAttempts) { + lines.push( + 'This is the final automatic repair attempt. Prefer outcome="decide" over guessing; autonomous mode will pause with your decision question for human acceptance.', + ); + } + lines.push( "If no useful progress happened, use outcome=blocked and explain why.", - ].join("\n"); + ); + return lines.join("\n"); +} + +function getMissingCheckpointRepairAttempts(state) { + if (Array.isArray(state?.missingCheckpointRepairs)) { + return state.missingCheckpointRepairs; + } + if (state?.missingCheckpointRetry) return [state.missingCheckpointRetry]; + return []; +} + +function missingCheckpointRepairMode(attempt) { + if (attempt <= 1) return "strict-checkpoint-tool-only"; + if (attempt === 2) return "artifact-and-transcript-reconstruction"; + if (attempt === 3) return "confidence-gated-reconstruction"; + return "final-human-acceptance-gate"; +} + +function stringifyMessages(messages) { + if (!Array.isArray(messages)) return ""; + return messages + .map((message) => { + if (typeof message === "string") return message; + if (!message || typeof message !== "object") return String(message ?? ""); + if (typeof message.content === "string") return message.content; + if (Array.isArray(message.content)) { + return message.content + .map((part) => { + if (typeof part === "string") return part; + if (part?.text) return String(part.text); + return JSON.stringify(part); + }) + .join("\n"); + } + return JSON.stringify(message); + }) + .join("\n"); +} + +function truncateEvidence(text, limit = 4000) { + const value = String(text ?? ""); + return value.length > limit ? `${value.slice(0, limit)}\n[truncated]` : value; } /** diff --git a/src/resources/extensions/sf/bootstrap/db-tools.js b/src/resources/extensions/sf/bootstrap/db-tools.js index 3dc7c248f..83d459ff8 100644 --- a/src/resources/extensions/sf/bootstrap/db-tools.js +++ b/src/resources/extensions/sf/bootstrap/db-tools.js @@ -936,6 +936,8 @@ export function registerDbTools(pi) { "Checkpoint autonomous solver progress with PDD fields and semantic outcome", promptGuidelines: [ "Call sf_autonomous_checkpoint before ending an autonomous unit turn.", + "Do not write SUMMARY.md, LOOP.md, task files, or chat prose as a substitute for this tool call.", + "The checkpoint is recorded only when this actual tool returns success.", "Use outcome=complete only when the normal unit completion artifact/tool is also complete.", "Use outcome=continue when you made real progress but the unit needs another autonomous iteration.", "Use outcome=blocked for missing facts, credentials, broken environment, or impossible next steps.", diff --git a/src/resources/extensions/sf/canonical-milestone-plan.js b/src/resources/extensions/sf/canonical-milestone-plan.js index c6f8f2916..55c2ba0ab 100644 --- a/src/resources/extensions/sf/canonical-milestone-plan.js +++ b/src/resources/extensions/sf/canonical-milestone-plan.js @@ -61,6 +61,10 @@ function blockedResult(source, milestoneId, reason, paths) { }; } +function dbExists(basePath) { + return existsSync(projectDbPath(basePath)); +} + function normalizeStringArray(value) { if (!Array.isArray(value)) return []; return value.filter((item) => typeof item === "string"); @@ -143,6 +147,7 @@ function normalizeSliceFromProjection(raw) { ), isSketch: raw?.isSketch === true || raw?.is_sketch === 1, sketchScope: String(raw?.sketchScope ?? raw?.sketch_scope ?? ""), + sequence: Number(raw?.sequence ?? 0), }; } @@ -159,12 +164,28 @@ function readDbPlan(basePath, milestoneId) { ) { openDatabase(dbPath); } + const openedDbPath = getDbPath(); + if ( + isDbAvailable() && + openedDbPath && + openedDbPath !== dbPath && + !existsSync(dbPath) + ) { + return null; + } if (!isDbAvailable()) return null; return readTransaction(() => { const milestone = getMilestone(milestoneId); - if (!milestone) return null; + if (!milestone) { + return { missing: "milestone" }; + } const slices = getMilestoneSlices(milestoneId); - if (slices.length === 0) return null; + if (slices.length === 0) { + return { + missing: "slices", + milestone: normalizeMilestoneFromDb(milestone), + }; + } return { milestone: normalizeMilestoneFromDb(milestone), slices: slices.map(normalizeSliceFromDb), @@ -202,18 +223,41 @@ function readProjectionPlan(basePath, milestoneId) { * Consumer: auto dispatch and doctor migration flows that need milestone * slices without parsing ROADMAP.md as executable state. */ -export function getCanonicalMilestonePlan(basePath, milestoneId) { +export function getCanonicalMilestonePlan(basePath, milestoneId, options = {}) { const paths = { db: projectDbPath(basePath), projection: roadmapJsonPath(basePath, milestoneId), markdown: roadmapMdPath(basePath, milestoneId), }; + const allowProjectionFallback = options.allowProjectionFallback === true; try { const dbPlan = readDbPlan(basePath, milestoneId); + if (dbPlan?.missing) { + return blockedResult( + `db-missing-${dbPlan.missing}`, + milestoneId, + `.sf/sf.db is available, so ${milestoneId} must be read from DB rows. Missing ${dbPlan.missing}; projection files are export/recovery only. Run /sf doctor or sf recover to reconcile.`, + paths, + ); + } if (dbPlan) return okResult("db", dbPlan.milestone, dbPlan.slices, paths); - } catch { - // DB availability is opportunistic for this accessor; projection is the - // structured fallback. Markdown remains non-executable. + } catch (err) { + if (dbExists(basePath)) { + return blockedResult( + "db-unavailable", + milestoneId, + `.sf/sf.db exists but could not be read: ${err instanceof Error ? err.message : String(err)}. Runtime does not fall back to projections when DB is expected; run sf recover.`, + paths, + ); + } + } + if (dbExists(basePath) && !allowProjectionFallback) { + return blockedResult( + "db-required", + milestoneId, + `.sf/sf.db exists, so projection files are not executable runtime state. Populate DB rows or run sf recover.`, + paths, + ); } try { const projectionPlan = readProjectionPlan(basePath, milestoneId); diff --git a/src/resources/extensions/sf/doctor-engine-checks.js b/src/resources/extensions/sf/doctor-engine-checks.js index 9a73ed11a..c4bcd01f4 100644 --- a/src/resources/extensions/sf/doctor-engine-checks.js +++ b/src/resources/extensions/sf/doctor-engine-checks.js @@ -1,9 +1,18 @@ -import { existsSync, readdirSync, renameSync, rmSync, statSync } from "node:fs"; +import { + existsSync, + readdirSync, + readFileSync, + renameSync, + rmSync, + statSync, +} from "node:fs"; import { join } from "node:path"; import { milestonesDir, resolveMilestoneFile } from "./paths.js"; import { _getAdapter, getAllMilestones, + getMilestoneSlices, + getMilestoneValidationAssessment, isDbAvailable, openDatabase, } from "./sf-db.js"; @@ -12,12 +21,86 @@ import { summarizeParityHealth, writeParityReport, } from "./uok/parity-report.js"; +import { extractVerdict } from "./verdict-parser.js"; import { readEvents } from "./workflow-events.js"; import { renderAllProjections } from "./workflow-projections.js"; const LEGACY_MILESTONE_DIR_RE = /^(M\d+)-.+$/; const LEGACY_SLICE_DIR_RE = /^(S\d+)-.+$/; +function projectionDriftIssues(basePath, milestoneId) { + const issues = []; + const roadmapJsonPath = join( + basePath, + ".sf", + "milestones", + milestoneId, + `${milestoneId}-ROADMAP.json`, + ); + if (roadmapJsonPath && existsSync(roadmapJsonPath)) { + try { + const projected = JSON.parse(readFileSync(roadmapJsonPath, "utf-8")); + const dbSlices = getMilestoneSlices(milestoneId).map((slice) => ({ + id: slice.id, + status: slice.status ?? "", + sequence: Number(slice.sequence ?? 0), + })); + const projectedSlices = Array.isArray(projected?.slices) + ? projected.slices.map((slice) => ({ + id: String(slice?.id ?? slice?.sliceId ?? ""), + status: String(slice?.status ?? ""), + sequence: Number(slice?.sequence ?? 0), + })) + : []; + if (JSON.stringify(dbSlices) !== JSON.stringify(projectedSlices)) { + issues.push({ + severity: "warning", + code: "db_projection_roadmap_drift", + scope: "milestone", + unitId: milestoneId, + message: `${milestoneId}-ROADMAP.json differs from DB slice order/status. Projection is export/recovery only; DB remains authoritative.`, + file: roadmapJsonPath, + fixable: false, + }); + } + } catch (err) { + issues.push({ + severity: "warning", + code: "db_projection_roadmap_invalid", + scope: "milestone", + unitId: milestoneId, + message: `${milestoneId}-ROADMAP.json could not be parsed: ${err instanceof Error ? err.message : String(err)}. Projection is not executable runtime state.`, + file: roadmapJsonPath, + fixable: false, + }); + } + } + const validationPath = resolveMilestoneFile( + basePath, + milestoneId, + "VALIDATION", + ); + if (validationPath && existsSync(validationPath)) { + const dbAssessment = getMilestoneValidationAssessment(milestoneId); + if (dbAssessment?.status) { + const fileVerdict = extractVerdict(readFileSync(validationPath, "utf-8")); + const dbVerdict = String(dbAssessment.status).trim().toLowerCase(); + if (fileVerdict && fileVerdict !== dbVerdict) { + issues.push({ + severity: "warning", + code: "db_projection_validation_drift", + scope: "milestone", + unitId: milestoneId, + message: `${milestoneId}-VALIDATION.md verdict "${fileVerdict}" differs from DB assessment "${dbVerdict}". DB remains authoritative.`, + file: validationPath, + fixable: false, + }); + } + } + } + return issues; +} + function legacyBareId(name, pattern) { const match = name.match(pattern); return match?.[1] ?? null; @@ -297,6 +380,16 @@ export async function checkEngineHealth( } catch { // Non-fatal — duplicate ID check failed } + // e. DB/projection drift. Projections are useful for humans, exports, + // and recovery, but runtime must not treat them as executable state when + // the DB is available. + try { + for (const milestone of getAllMilestones()) { + issues.push(...projectionDriftIssues(basePath, milestone.id)); + } + } catch { + // Non-fatal — projection drift checks must not block doctor. + } } } catch { // Non-fatal — DB constraint checks failed entirely diff --git a/src/resources/extensions/sf/prompts/gate-evaluate.md b/src/resources/extensions/sf/prompts/gate-evaluate.md index 583b7c746..28e546a63 100644 --- a/src/resources/extensions/sf/prompts/gate-evaluate.md +++ b/src/resources/extensions/sf/prompts/gate-evaluate.md @@ -33,6 +33,11 @@ Gate agents return one of `passed`, `failed`, or `omitted`. - `failed` — the gate question was answerable for this slice and the answer is "concern that must be addressed before execution". - `omitted` — the gate question is *not applicable* to this slice (e.g., no auth surface for an auth gate; no existing requirements touched for a requirements gate). +**Gate vs. Task Scope Rubric:** +- GATE concern = blocker to execution (violates architecture, breaks contract, unmet acceptance criterion). +- TASK detail = implementation note that doesn't block (edge case refinement, testing strategy, performance tuning). +When a gate runs and surfaces a concern, ask: does this prevent honest proof of the milestone's stated success? If yes, it's a failed gate. If it's a nice-to-have improvement but the core criterion is met, the gate passes and the concern becomes backlog hardening. + **`omitted` is not a hedge.** It exists for "the question does not apply here", not for "I am not sure". If the gate ran and the question was answerable, the agent must decide `passed` or `failed`. Each `omitted` verdict must include a one-line reason naming what is absent (e.g., "no network calls in this slice"). When the batch returns, scan for `omitted` verdicts without a reason. Treat any unexplained `omitted` as failed-to-decide and re-dispatch that gate with an explicit instruction to pick `passed` or `failed`. diff --git a/src/resources/extensions/sf/roadmap-json-projection.js b/src/resources/extensions/sf/roadmap-json-projection.js index 68309aa74..7db4475ce 100644 --- a/src/resources/extensions/sf/roadmap-json-projection.js +++ b/src/resources/extensions/sf/roadmap-json-projection.js @@ -1,12 +1,13 @@ /** * roadmap-json-projection.js - structured roadmap projection rendering. * - * Purpose: keep dispatch fallback state machine-readable while ROADMAP.md - * remains a human display artifact. + * Purpose: keep DB-origin roadmap state machine-readable for export and + * recovery while ROADMAP.md remains a human display artifact. */ import { mkdirSync } from "node:fs"; import { join } from "node:path"; import { atomicWriteSync } from "./atomic-write.js"; +import { getDbPath } from "./sf-db.js"; function normalizeStringArray(value) { if (!Array.isArray(value)) return []; @@ -62,14 +63,15 @@ function normalizeSlice(sliceRow) { ), isSketch: sliceRow.isSketch === true || sliceRow.is_sketch === 1, sketchScope: String(sliceRow.sketchScope ?? sliceRow.sketch_scope ?? ""), + sequence: Number(sliceRow.sequence ?? 0), }; } /** * Render structured ROADMAP.json content from database rows. * - * Purpose: provide a deterministic fallback projection for dispatch when the - * SQLite database is unavailable. + * Purpose: provide a deterministic DB projection for export/recovery without + * making JSON a peer source of truth for normal runtime dispatch. * * Consumer: roadmap renderers invoked by planning and reassessment tools. */ @@ -77,6 +79,9 @@ export function renderRoadmapJsonProjectionContent(milestoneRow, sliceRows) { return JSON.stringify( { schemaVersion: 1, + origin: "db-projection", + generatedAt: new Date().toISOString(), + sourceDbPath: getDbPath() ?? null, milestone: normalizeMilestone(milestoneRow), slices: sliceRows.map(normalizeSlice), }, diff --git a/src/resources/extensions/sf/sf-db.js b/src/resources/extensions/sf/sf-db.js index 54d694587..6289b5e56 100644 --- a/src/resources/extensions/sf/sf-db.js +++ b/src/resources/extensions/sf/sf-db.js @@ -3293,6 +3293,21 @@ export function getAssessment(path) { .get({ ":path": path }); return row ?? null; } +export function getAssessmentByScope(milestoneId, scope) { + if (!currentDb) return null; + const row = currentDb + .prepare( + `SELECT * FROM assessments + WHERE milestone_id = :mid AND scope = :scope + ORDER BY created_at DESC + LIMIT 1`, + ) + .get({ ":mid": milestoneId, ":scope": scope }); + return row ?? null; +} +export function getMilestoneValidationAssessment(milestoneId) { + return getAssessmentByScope(milestoneId, "milestone-validation"); +} // ─── Quality Gates ─────────────────────────────────────────────────────── function rowToGate(row) { return { diff --git a/src/resources/extensions/sf/state.js b/src/resources/extensions/sf/state.js index 2c90b4582..76efc7eb3 100644 --- a/src/resources/extensions/sf/state.js +++ b/src/resources/extensions/sf/state.js @@ -32,6 +32,7 @@ import { getAllMilestones, getMilestone, getMilestoneSlices, + getMilestoneValidationAssessment, getPendingGateCountForTurn, getReplanHistory, getSlice, @@ -113,6 +114,35 @@ export function isMilestoneComplete(roadmap) { export function isValidationTerminal(validationContent) { return extractVerdict(validationContent) != null; } +function getDbMilestoneValidationVerdict(milestoneId) { + if (!isDbAvailable()) return undefined; + const assessment = getMilestoneValidationAssessment(milestoneId); + const status = assessment?.status; + return typeof status === "string" && status.trim() + ? status.trim().toLowerCase() + : undefined; +} +async function readMilestoneValidationVerdict(basePath, milestoneId, load) { + const dbVerdict = getDbMilestoneValidationVerdict(milestoneId); + if (dbVerdict) { + return { terminal: true, verdict: dbVerdict }; + } + if (isDbAvailable()) { + return { terminal: false, verdict: undefined, source: "db-missing" }; + } + const validationFile = resolveMilestoneFile( + basePath, + milestoneId, + "VALIDATION", + ); + const validationContent = validationFile ? await load(validationFile) : null; + return { + terminal: validationContent + ? isValidationTerminal(validationContent) + : false, + verdict: validationContent ? extractVerdict(validationContent) : undefined, + }; +} const CACHE_TTL_MS = 5000; let _stateCache = null; // ── Telemetry counters for derive-path observability ──────────────────────── @@ -221,19 +251,13 @@ export async function deriveState(basePath) { } if (synced) dbMilestones = getAllMilestones(); } - if (dbMilestones.length > 0) { - const stopDbTimer = debugTime("derive-state-db"); - result = await deriveStateFromDb(basePath); - stopDbTimer({ - phase: result.phase, - milestone: result.activeMilestone?.id, - }); - _telemetry.dbDeriveCount++; - } else { - // DB open but no milestones on disk either — use filesystem path - result = await _deriveStateImpl(basePath); - _telemetry.markdownDeriveCount++; - } + const stopDbTimer = debugTime("derive-state-db"); + result = await deriveStateFromDb(basePath); + stopDbTimer({ + phase: result.phase, + milestone: result.activeMilestone?.id, + }); + _telemetry.dbDeriveCount++; } else { // Only warn when DB initialization was attempted and failed — not when // the DB simply hasn't been opened yet (e.g. during before_agent_start @@ -463,17 +487,8 @@ async function buildRegistryAndFindActive( } } if (allSlicesDone) { - const validationFile = resolveMilestoneFile( - basePath, - m.id, - "VALIDATION", - ); - const validationContent = validationFile - ? await loadFile(validationFile) - : null; - const validationTerminal = validationContent - ? isValidationTerminal(validationContent) - : false; + const { terminal: validationTerminal } = + await readMilestoneValidationVerdict(basePath, m.id, loadFile); // DB-authoritative (#4179): completeness is already decided by // completeMilestoneIds above. If we reached this branch, the DB says // the milestone is NOT complete — so any SUMMARY file on disk is an @@ -617,20 +632,12 @@ async function handleAllSlicesDone( milestoneProgress, sliceProgress, ) { - const validationFile = resolveMilestoneFile( - basePath, - activeMilestone.id, - "VALIDATION", - ); - const validationContent = validationFile - ? await loadFile(validationFile) - : null; - const validationTerminal = validationContent - ? isValidationTerminal(validationContent) - : false; - const verdict = validationContent - ? extractVerdict(validationContent) - : undefined; + const { terminal: validationTerminal, verdict } = + await readMilestoneValidationVerdict( + basePath, + activeMilestone.id, + loadFile, + ); if (!validationTerminal || verdict === "needs-remediation") { return { activeMilestone, @@ -1410,16 +1417,8 @@ export async function _deriveStateImpl(basePath) { if (complete) { // All slices done — check validation and summary state const summaryFile = resolveMilestoneFile(basePath, mid, "SUMMARY"); - const validationFile = resolveMilestoneFile(basePath, mid, "VALIDATION"); - const validationContent = validationFile - ? await cachedLoadFile(validationFile) - : null; - const validationTerminal = validationContent - ? isValidationTerminal(validationContent) - : false; - const verdict = validationContent - ? extractVerdict(validationContent) - : undefined; + const { terminal: validationTerminal, verdict } = + await readMilestoneValidationVerdict(basePath, mid, cachedLoadFile); // needs-remediation is terminal but requires re-validation (#3596) const needsRevalidation = !validationTerminal || verdict === "needs-remediation"; @@ -1659,20 +1658,12 @@ export async function _deriveStateImpl(basePath) { } // Check if active milestone needs validation or completion (all slices done) if (isMilestoneComplete(activeRoadmap)) { - const validationFile = resolveMilestoneFile( - basePath, - activeMilestone.id, - "VALIDATION", - ); - const validationContent = validationFile - ? await cachedLoadFile(validationFile) - : null; - const validationTerminal = validationContent - ? isValidationTerminal(validationContent) - : false; - const verdict = validationContent - ? extractVerdict(validationContent) - : undefined; + const { terminal: validationTerminal, verdict } = + await readMilestoneValidationVerdict( + basePath, + activeMilestone.id, + cachedLoadFile, + ); const sliceProgress = { done: activeRoadmap.slices.length, total: activeRoadmap.slices.length, diff --git a/src/resources/extensions/sf/tests/autonomous-solver.test.mjs b/src/resources/extensions/sf/tests/autonomous-solver.test.mjs index 42e689815..08c77b3f5 100644 --- a/src/resources/extensions/sf/tests/autonomous-solver.test.mjs +++ b/src/resources/extensions/sf/tests/autonomous-solver.test.mjs @@ -7,7 +7,9 @@ import { appendAutonomousSolverSteering, assessAutonomousSolverTurn, beginAutonomousSolverIteration, + buildAutonomousSolverMissingCheckpointRepairPrompt, buildAutonomousSolverPromptBlock, + classifyAutonomousSolverMissingCheckpointFailure, consumePendingAutonomousSolverSteering, getConfiguredAutonomousSolverMaxIterations, readLatestAutonomousSolverCheckpoint, @@ -121,35 +123,85 @@ describe("autonomous solver", () => { expect(prompt).toContain("/sf autonomous iteration 3 of 12"); expect(prompt).toContain("sf_autonomous_checkpoint"); + expect(prompt).toContain("Writing SUMMARY.md"); + expect(prompt).toContain("is not a checkpoint"); + expect(prompt).toContain("final autonomous action"); expect(prompt).toContain("Purpose:"); expect(prompt).toContain("Consumer:"); expect(prompt).toContain("Failure boundary:"); expect(prompt).toContain('outcome: "decide"'); }); - test("assessAutonomousSolverTurn_missing_checkpoint_retries_once_then_pauses", () => { + test("buildAutonomousSolverMissingCheckpointRepairPrompt_rejects_file_substitutes", () => { + const prompt = buildAutonomousSolverMissingCheckpointRepairPrompt( + { iteration: 2 }, + "research-slice", + "M012/parallel-research", + ); + + expect(prompt).toContain("actual sf_autonomous_checkpoint tool"); + expect(prompt).toContain("Do not write a summary file as a substitute"); + expect(prompt).toContain("tool call succeeds"); + expect(prompt).toContain("final action"); + }); + + test("buildAutonomousSolverMissingCheckpointRepairPrompt_escalates_to_confidence_gated_decide", () => { + const prompt = buildAutonomousSolverMissingCheckpointRepairPrompt( + { iteration: 2 }, + "research-slice", + "M012/parallel-research", + { classification: "file-substituted-for-checkpoint", summary: "summary" }, + 3, + 4, + ); + + expect(prompt).toContain("Repair attempt: 3 of 4"); + expect(prompt).toContain("confidence"); + expect(prompt).toContain("0.98"); + expect(prompt).toContain('outcome="decide"'); + expect(prompt).toContain("decisionQuestion"); + }); + + test("assessAutonomousSolverTurn_missing_checkpoint_escalates_repairs_then_pauses", () => { const project = makeProject(); beginAutonomousSolverIteration(project, "execute-task", "M001/S01/T01"); - const first = assessAutonomousSolverTurn( - project, - "execute-task", - "M001/S01/T01", - ); - expect(first.action).toBe("missing-checkpoint-retry"); + for (let attempt = 1; attempt <= 4; attempt++) { + const assessment = assessAutonomousSolverTurn( + project, + "execute-task", + "M001/S01/T01", + ); + expect(assessment.action).toBe("missing-checkpoint-retry"); + expect(assessment.repairAttempt).toBe(attempt); + recordAutonomousSolverMissingCheckpointRetry( + project, + "execute-task", + "M001/S01/T01", + ); + } - recordAutonomousSolverMissingCheckpointRetry( + const final = assessAutonomousSolverTurn( project, "execute-task", "M001/S01/T01", ); - const second = assessAutonomousSolverTurn( - project, - "execute-task", - "M001/S01/T01", - ); - expect(second.action).toBe("pause"); - expect(second.reason).toBe("solver-missing-checkpoint"); + expect(final.action).toBe("pause"); + expect(final.reason).toBe("solver-missing-checkpoint"); + expect(final.repairAttempts).toBe(4); + }); + + test("classifyAutonomousSolverMissingCheckpointFailure_detects_summary_substitute", () => { + const diagnosis = classifyAutonomousSolverMissingCheckpointFailure([ + { + role: "assistant", + content: + "The checkpoint is now saved to .sf/milestones/M012/slices/parallel-research/tasks/T01-SUMMARY.md", + }, + ]); + + expect(diagnosis.classification).toBe("file-substituted-for-checkpoint"); + expect(diagnosis.summary).toContain("summary"); }); test("assessAutonomousSolverTurn_continue_and_blocked_are_authoritative", () => { diff --git a/src/resources/extensions/sf/tests/canonical-milestone-plan.test.mjs b/src/resources/extensions/sf/tests/canonical-milestone-plan.test.mjs index f739d6049..177568bb8 100644 --- a/src/resources/extensions/sf/tests/canonical-milestone-plan.test.mjs +++ b/src/resources/extensions/sf/tests/canonical-milestone-plan.test.mjs @@ -59,6 +59,16 @@ describe("getCanonicalMilestonePlan", () => { risk: "low", depends: [], demo: "DB slice demo.", + sequence: 2, + }); + insertSlice({ + milestoneId: "M321", + id: "S00", + title: "Higher priority DB slice", + status: "pending", + risk: "low", + depends: [], + demo: "Runs first.", sequence: 1, }); writeMilestoneFile( @@ -73,8 +83,8 @@ describe("getCanonicalMilestonePlan", () => { expect(result.safe).toBe(true); expect(result.blocked).toBe(false); expect(result.source).toBe("db"); - expect(result.slices).toHaveLength(1); - expect(result.slices[0]).toMatchObject({ id: "S01", title: "DB slice" }); + expect(result.slices).toHaveLength(2); + expect(result.slices.map((slice) => slice.id)).toEqual(["S00", "S01"]); } finally { closeDatabase(); cleanup(base); @@ -129,6 +139,44 @@ describe("getCanonicalMilestonePlan", () => { } }); + test("db_backed_project_when_db_slices_missing_blocks_projection_fallback", () => { + const base = makeTempDir("sf-canonical-plan-"); + try { + mkdirSync(join(base, ".sf"), { recursive: true }); + openDatabase(join(base, ".sf", "sf.db")); + insertMilestone({ + id: "M324", + title: "DB missing slices", + status: "active", + }); + writeMilestoneFile( + base, + "M324", + "M324-ROADMAP.json", + JSON.stringify( + { + origin: "db-projection", + milestone: { id: "M324", title: "Projected milestone" }, + slices: [{ id: "S99", title: "Stale projected slice" }], + }, + null, + 2, + ), + ); + + const result = getCanonicalMilestonePlan(base, "M324"); + + expect(result.safe).toBe(false); + expect(result.source).toBe("db-missing-slices"); + expect(result.reason).toContain( + "projection files are export/recovery only", + ); + } finally { + closeDatabase(); + cleanup(base); + } + }); + test("markdown_only_when_no_structured_state_returns_unsafe_blocked_result", () => { const base = makeTempDir("sf-canonical-plan-"); try { diff --git a/src/resources/extensions/sf/tests/doctor-db-projection-drift.test.mjs b/src/resources/extensions/sf/tests/doctor-db-projection-drift.test.mjs new file mode 100644 index 000000000..790b5f22a --- /dev/null +++ b/src/resources/extensions/sf/tests/doctor-db-projection-drift.test.mjs @@ -0,0 +1,89 @@ +/** + * doctor-db-projection-drift.test.mjs - DB/projection drift diagnostics. + * + * Purpose: prove doctor reports generated-file drift without treating JSON or + * Markdown projections as executable runtime truth. + */ +import assert from "node:assert/strict"; +import { mkdirSync, mkdtempSync, rmSync, writeFileSync } from "node:fs"; +import { tmpdir } from "node:os"; +import { join } from "node:path"; +import { afterEach, test } from "vitest"; +import { checkEngineHealth } from "../doctor-engine-checks.js"; +import { + closeDatabase, + insertAssessment, + insertMilestone, + insertSlice, + openDatabase, +} from "../sf-db.js"; + +const tmpDirs = []; + +afterEach(() => { + closeDatabase(); + while (tmpDirs.length > 0) { + rmSync(tmpDirs.pop(), { recursive: true, force: true }); + } +}); + +function makeProject() { + const dir = mkdtempSync(join(tmpdir(), "sf-db-projection-drift-")); + tmpDirs.push(dir); + mkdirSync(join(dir, ".sf", "milestones", "M901"), { recursive: true }); + openDatabase(join(dir, ".sf", "sf.db")); + insertMilestone({ + id: "M901", + title: "Projection drift", + status: "active", + }); + insertSlice({ + milestoneId: "M901", + id: "S01", + title: "DB slice", + status: "pending", + sequence: 1, + }); + return dir; +} + +test("checkEngineHealth_when_projection_files_disagree_with_db_reports_drift", async () => { + const project = makeProject(); + const milestoneDir = join(project, ".sf", "milestones", "M901"); + const validationPath = join(milestoneDir, "M901-VALIDATION.md"); + writeFileSync( + join(milestoneDir, "M901-ROADMAP.json"), + `${JSON.stringify( + { + origin: "db-projection", + milestone: { id: "M901", title: "Projection drift" }, + slices: [{ id: "S99", status: "complete", sequence: 99 }], + }, + null, + 2, + )}\n`, + ); + writeFileSync( + validationPath, + ["---", "verdict: needs-remediation", "---", "", "# stale"].join("\n"), + ); + insertAssessment({ + path: validationPath, + milestoneId: "M901", + scope: "milestone-validation", + status: "pass", + fullContent: ["---", "verdict: pass", "---", "", "# db"].join("\n"), + }); + const issues = []; + + await checkEngineHealth(project, issues, [], () => false); + + assert.equal( + issues.some((issue) => issue.code === "db_projection_roadmap_drift"), + true, + ); + assert.equal( + issues.some((issue) => issue.code === "db_projection_validation_drift"), + true, + ); +}); diff --git a/src/resources/extensions/sf/tests/milestone-validation-db-authority.test.mjs b/src/resources/extensions/sf/tests/milestone-validation-db-authority.test.mjs new file mode 100644 index 000000000..be4a6fbfe --- /dev/null +++ b/src/resources/extensions/sf/tests/milestone-validation-db-authority.test.mjs @@ -0,0 +1,132 @@ +/** + * milestone-validation-db-authority.test.mjs — validation verdict source of truth. + * + * Purpose: prove DB-backed projects derive milestone validation phase from the + * structured assessment row, with VALIDATION.md only used when DB is unavailable. + */ +import assert from "node:assert/strict"; +import { mkdirSync, mkdtempSync, rmSync, writeFileSync } from "node:fs"; +import { tmpdir } from "node:os"; +import { join } from "node:path"; +import { afterEach, test } from "vitest"; +import { + closeDatabase, + insertAssessment, + insertMilestone, + insertSlice, + openDatabase, +} from "../sf-db.js"; +import { deriveState, invalidateStateCache } from "../state.js"; + +const tmpDirs = []; + +afterEach(() => { + closeDatabase(); + invalidateStateCache(); + while (tmpDirs.length > 0) { + const dir = tmpDirs.pop(); + if (dir) rmSync(dir, { recursive: true, force: true }); + } +}); + +function makeProject() { + const dir = mkdtempSync(join(tmpdir(), "sf-validation-db-authority-")); + tmpDirs.push(dir); + mkdirSync(join(dir, ".sf", "milestones", "M001"), { recursive: true }); + openDatabase(join(dir, ".sf", "sf.db")); + insertMilestone({ + id: "M001", + title: "Validation DB authority", + status: "active", + }); + insertSlice({ + milestoneId: "M001", + id: "S01", + title: "Done slice", + status: "complete", + sequence: 1, + }); + writeFileSync( + join(dir, ".sf", "milestones", "M001", "M001-ROADMAP.md"), + [ + "# M001: Validation DB authority", + "", + "## Slice Overview", + "| ID | Slice | Risk | Depends | Done | After this |", + "|----|-------|------|---------|------|------------|", + "| S01 | Done slice | low | - | ✅ | done |", + "", + ].join("\n"), + ); + return dir; +} + +test("deriveState_when_db_validation_pass_and_file_stale_completes_validation_phase", async () => { + const project = makeProject(); + const validationPath = join( + project, + ".sf", + "milestones", + "M001", + "M001-VALIDATION.md", + ); + writeFileSync( + validationPath, + [ + "---", + "verdict: needs-remediation", + "remediation_round: 1", + "---", + "", + "# stale validation projection", + "", + ].join("\n"), + ); + insertAssessment({ + path: validationPath, + milestoneId: "M001", + status: "pass", + scope: "milestone-validation", + fullContent: [ + "---", + "verdict: pass", + "remediation_round: 2", + "---", + "", + "# DB validation pass", + "", + ].join("\n"), + }); + + const state = await deriveState(project); + + assert.equal(state.phase, "completing-milestone"); + assert.equal(state.activeMilestone?.id, "M001"); +}); + +test("deriveState_when_db_validation_missing_and_file_has_verdict_requires_db_assessment", async () => { + const project = makeProject(); + const validationPath = join( + project, + ".sf", + "milestones", + "M001", + "M001-VALIDATION.md", + ); + writeFileSync( + validationPath, + [ + "---", + "verdict: pass", + "---", + "", + "# stale validation projection", + "", + ].join("\n"), + ); + + const state = await deriveState(project); + + assert.equal(state.phase, "validating-milestone"); + assert.equal(state.activeMilestone?.id, "M001"); +}); diff --git a/src/resources/extensions/sf/tests/roadmap-json-projection.test.mjs b/src/resources/extensions/sf/tests/roadmap-json-projection.test.mjs index 33f8182e7..a9eeeaeb0 100644 --- a/src/resources/extensions/sf/tests/roadmap-json-projection.test.mjs +++ b/src/resources/extensions/sf/tests/roadmap-json-projection.test.mjs @@ -36,7 +36,7 @@ function makeProject() { } describe("roadmap JSON projection", () => { - test("renderRoadmapFromDb_writes_structured_projection_for_dispatch_fallback", async () => { + test("renderRoadmapFromDb_writes_db_origin_projection_with_sequence", async () => { const project = makeProject(); insertMilestone({ id: "M451", @@ -64,15 +64,25 @@ describe("roadmap JSON projection", () => { assert.equal(existsSync(result.roadmapJsonPath), true); const json = JSON.parse(readFileSync(result.roadmapJsonPath, "utf-8")); assert.equal(json.schemaVersion, 1); + assert.equal(json.origin, "db-projection"); + assert.equal(typeof json.generatedAt, "string"); + assert.equal(json.sourceDbPath, join(project, ".sf", "sf.db")); assert.equal(json.milestone.id, "M451"); assert.deepEqual( - json.slices.map((slice) => [slice.id, slice.title, slice.risk]), - [["S01", "Projected slice", "medium"]], + json.slices.map((slice) => [ + slice.id, + slice.title, + slice.risk, + slice.sequence, + ]), + [["S01", "Projected slice", "medium", 1]], ); closeDatabase(); rmSync(join(project, ".sf", "sf.db"), { force: true }); - const fallback = getCanonicalMilestonePlan(project, "M451"); + const fallback = getCanonicalMilestonePlan(project, "M451", { + allowProjectionFallback: true, + }); assert.equal(fallback.safe, true); assert.equal(fallback.source, "projection"); diff --git a/src/resources/extensions/sf/workflow-manifest.js b/src/resources/extensions/sf/workflow-manifest.js index 9e129a36e..f5d57e1e9 100644 --- a/src/resources/extensions/sf/workflow-manifest.js +++ b/src/resources/extensions/sf/workflow-manifest.js @@ -1,7 +1,12 @@ import { existsSync, mkdirSync, readFileSync } from "node:fs"; import { join } from "node:path"; import { atomicWriteSync } from "./atomic-write.js"; -import { _getAdapter, readTransaction, restoreManifest } from "./sf-db.js"; +import { + _getAdapter, + getDbPath, + readTransaction, + restoreManifest, +} from "./sf-db.js"; // ─── helpers ───────────────────────────────────────────────────────────── /** @@ -214,6 +219,8 @@ export function snapshotState() { })); const result = { version: 1, + origin: "db-projection", + source_db_path: getDbPath() ?? null, exported_at: new Date().toISOString(), milestones, slices,