diff --git a/TODO.md b/TODO.md index f4f863ab4..645f33d0e 100644 --- a/TODO.md +++ b/TODO.md @@ -5,131 +5,15 @@ runtime memory or an approved backlog. ## Untriaged Notes -### Feature Gaps & Limitations (2026-05-06) - -#### Critical Path Gaps -- **Monolithic auto-dispatch/auto-prompts files** (61KB, 123KB) — acknowledged in ADR-001 as needing decomposition. Blocks easy navigation and testing of autonomous dispatch logic. -- **Extension-provided models incomplete** — Extensions cannot reliably register custom model variants. Model selection system needs refactoring to expose before_model_select hook properly. -- **No typed environment schema** — SF_* env vars have no runtime validation. Missing config is silent and hard to debug. Need zod/io-ts schema in env.ts. - -#### Backlog Features (BUILD_PLAN.md Tier 1-2) -- **Product-audit phase auto-fire** — Tool callable but PhaseMerge/PhaseComplete dispatch not wired. Manual ports from gsd-2 needed. -- **Extended config-overlay keys** — Missing context_compact_at, context_hard_limit, unit_timeout_by_phase, max_agents_by_phase, turn_input_required, hot_cache_turns, etc. Users cannot tune critical perf/timeout settings. -- **Architecture doc auto-update** — No fast-dispatch at phase-end to detect if ARCHITECTURE.md/CONVENTIONS.md/STACK.md drifted. Auto-propose diffs for user approval. -- **Semantic checkpoint chapters** — No per-turn semantic "chapter" grouping for crash-resume context. Phase transitions inferred but not labeled. Impacts Hindsight recall usefulness. -- **Custom Anthropic SSE parser** — Still using @anthropic-ai/sdk client.messages.stream(). Should port pi-mono custom SSE parser (~200 LOC, 3 commits) to filter unknown-event + handle proxy events (issue #3708). -- **Symlinked package dedup** — Selectors/loaders show duplicates when packages/resources/skills/sessions symlinked (dev, CI). Port from pi-mono PR #3818. -- **Extension API setWorkingVisible()** — ctx.ui.setWorkingVisible() not yet added. Prevents extensions from hiding built-in working-loader; limits TUI customization. - -#### Provider Expansion (BUILD_PLAN.md Tier 0.5 - gsd-2 ports) -- **Cloudflare Workers AI provider** — Not yet in routing list. Ready in pi-mono PR #3851; 1-line port. -- **Azure Cognitive Services base URL** — Azure OpenAI Responses endpoint support not ported from pi-mono PR #3799. -- **Local LLM SSE timeout (5-min cutoff)** — Ollama/LM Studio over 5 min hit UND_ERR_BODY_TIMEOUT. Fix available in pi-mono `d0907b6d8` (1 commit). -- **Bedrock inference profile normalization** — Prompt-caching checks fail on inference profile ARNs. Fix in pi-mono `7c487bb60` (1 commit). - -#### Testing Gaps -- **Coverage thresholds too low** — 40% is acceptable but should be 60%+ for autonomous/critical paths (auto-dispatch, recovery, state machine). Add property-based testing (fast-check) for state transitions. -- **No end-to-end milestone lifecycle tests** — Missing integration tests covering full milestone flow. -- **No fault-injection/chaos tests** — Recovery paths (stuck-loop detection, timeout recovery, runaway guards) lack chaos/fault-injection coverage. - -#### Minor/Polish -- **Biome schema version mismatch** — biome.json v2.4.13 vs CLI v2.4.14. Run `biome migrate`. -- **MCP package completeness unclear** — docs reference mcp-server but completeness unknown. Verify packages/mcp-server/ is production-ready and document status. -- **Headless assistant-text preview deferred** — Notification categorization done; assistantTextBuffer/thinkingBuffer separation incomplete (see headless.ts). - -#### gsd-2 Safety/Correctness Ports (BUILD_PLAN.md Tier 0.5) -- **Bash evidence persistence race** — Close mid-unit re-dispatch race (gsd-2 `da7dd56e7`, PR #5056→#5058). Bash tool calls can lose evidence between dispatch and re-dispatch. -- **Project-controlled surface hardening** — Full fix supersedes partial cherry-pick at `66ff949c1` (gsd-2 `65ca5aa2e`). -- **Web_search injection narrowing** — Only inject web_search when provider accepts it (gsd-2 `4370bedf3`). -- **Symlinked .sf staging self-heal** — Data-loss prevention when staging dir is symlink (gsd-2 `9340f1e9b`, PR #4423). -- **Milestone KNOWLEDGE injection budgeting** — Prevents scope knowledge from blowing context budget (gsd-2 `58d3d4d6c`, PR #4721). -- **MCP-server stdout-buffer deadlock** — Large-output MCP tools could hang (gsd-2 `bb747ec57`). -- **Workflow state machine race protection** — Session transitions during agent_end, idle-wait optimization (gsd-2 commits `71114fccf`, `6d7e4ccb5`, `c162c44bf`, `e3bd04551`). -- **Claude-Code CLI Always-Allow persistence** — Grant persistence for non-Bash tools (gsd-2 `a88baeae9`, PR #5096). - -### UOK Self-Evolution Research (2026-05-06) - -#### Overview -Research into whether SF's UOK (Unified Operation Kernel) is best-in-class for a self-evolving coder agent. Full research report: see session research folder. - -**Verdict:** UOK is excellent for deterministic autonomous dispatch (beats typical LLM agents, rivals enterprise orchestrators) but only 60-70% complete for true self-evolution. Learning infrastructure exists but isn't actively used. - -#### Critical Finding: Documentation Gap -The implementation has **10+ undocumented features** not explained in ARCHITECTURE.md: -- Five-phase state machine with error recovery paths (PhaseDiscuss → PhasePlan → PhaseExecute → PhaseMerge → PhaseComplete) -- Gate-runner verdict semantics (passed/failed/omitted) + re-dispatch rules -- Outcome learning for model selection (Bayesian blending) -- Stuck-loop detection with recovery thresholds -- Evidence persistence and audit trails -- Sophisticated edge-case handling not documented - -**Action item:** Update ARCHITECTURE.md with full state machine diagram, gate semantics, and recovery paths. - -#### Self-Evolution Status - -**Infrastructure that works:** -- ✅ Self-report collection (sf_self_report captures anomalies during dispatch/validation) -- ✅ Outcome learning (Bayesian model selection per task-type) -- ✅ Knowledge compounding (KNOWLEDGE.md with judgment-log entries) -- ✅ Gate-based pattern detection (gates can detect repeated failures) - -**Feedback loop that's missing:** -- ❌ Triage pipeline — self-reports collected but not processed into fixes -- ❌ Continuous model tuning — learning exists but infrequent, not aggressive -- ❌ Automated knowledge injection — knowledge exists but not used in prompts -- ❌ Cross-gate pattern aggregation — gates run independently, don't see patterns -- ❌ Adaptive thresholds — all timeouts hardcoded, not data-driven -- ❌ Hypothesis testing — no A/B test framework for improvements -- ❌ Regression detection — no metrics monitoring for quality drift - -#### Top 3 Improvements (Quick Wins) - -1. **Close self-report feedback loop** [9/10 impact, 4/10 effort, 2-3 days] - - Auto-triage self-reports, create work items for fixes, promote high-confidence improvements to code - - **Why:** Reports are collected but ignored; this closes the feedback loop - - **Implementation:** Extend commands-todo.js triage logic to process sf_self_report events - -2. **Activate continuous model learning** [8/10 impact, 5/10 effort, 3-4 days] - - Track model success/failure per task type + latency + cost; auto-demote failing models; A/B test new models on low-risk tasks - - **Why:** Learning exists but is dormant; this makes dispatch adaptive - - **Implementation:** Enhance benchmark-selector.ts + model-router logic with aggressive per-task-type tracking - -3. **Automate knowledge injection** [7/10 impact, 4/10 effort, 2-3 days] - - Auto-query KNOWLEDGE.md for relevance during milestone planning; inject high-confidence learnings; flag contradictions - - **Why:** Knowledge exists but isn't used; this makes it actionable - - **Implementation:** Add to auto-prompts.js knowledge-injection stage; use semantic similarity scoring - -**Quick-win total:** ~8-10 days for high-leverage improvements that activate the learning loop. - -#### Additional Improvements (Medium-Term, 1-2 Months) - -4. **Continuous gate pattern aggregation** [8/10 impact, 6/10 effort, 3-4 days] — After each phase, detect common gate failure themes across all gates; aggregate into consolidated self-reports; suggest architectural fixes. - -5. **Adaptive timeout tuning** [7/10 impact, 6/10 effort, 3-4 days] — Replace hardcoded timeouts with data-driven values based on task execution history; auto-adjust per task-type. - -6. **Hypothesis testing framework** [9/10 impact, 7/10 effort, 4-5 days] — A/B test improvements on low-stakes tasks; roll back if they introduce regressions; never ship untested changes. - -7. **Cross-milestone federated learning** [8/10 impact, 9/10 effort, 8-10 days] — Share generalizable learnings across projects (same org); test on similar projects first; expand based on results. - -8. **Regression detection & prevention** [7/10 impact, 8/10 effort, 5-6 days] — Track key metrics (success rate, latency, cost, gate failures) across milestones; alert on regressions; auto-rollback bad changes. - -9. **Semantic drift detection** [6/10 impact, 7/10 effort, 4-5 days] — Detect when prompts/gate logic have drifted from original intent; file alerts; suggest reverting or documenting. - -10. **Self-hosted telemetry & profiling** [5/10 impact, 8/10 effort, 4-5 days] — When SF runs on itself (dogfooding), profile which phases/gates/model-selections take longest; prioritize optimizations. - -**Medium-term total:** ~25-30 days for comprehensive self-evolution roadmap. - -#### Documentation That Should Be Updated - -- [ ] ARCHITECTURE.md — Full state machine diagram with phase transitions and error recovery paths -- [ ] docs/dev/ADR-* — Document gate verdict semantics (passed/failed/omitted) and re-dispatch behavior -- [ ] User docs — Explain outcome learning, model selection tuning, knowledge compounding workflow -- [ ] Runbook — Stuck-loop detection, timeout adjustment, recovery paths -- [ ] Design guide — Best practices for implementing custom gates with pattern detection -- [ ] New section: "Self-Evolution Architecture" explaining feedback loops, learning mechanisms, and how to extend them +No untriaged notes. Add raw dumps here temporarily, then promote them to +`docs/plans/`, `docs/adr/`, `docs/specs/`, or another durable project artifact +before starting implementation. ## Processed Notes - SF auto-loop hardening was converted into SF milestone state on 2026-05-02. Continue from `.sf/STATE.md`, `.sf/milestones/M013/`, and `.sf/milestones/M014/` instead of reusing the old raw dump. +- Feature gaps and UOK self-evolution research from 2026-05-06 were triaged on + 2026-05-06 into `docs/plans/todo-triage-2026-05-06-plan.md` and existing + durable docs such as `docs/dev/UOK-SELF-EVOLUTION.md`. diff --git a/docs/FRONTEND.md b/docs/FRONTEND.md index 0a293d633..9be7ee574 100644 --- a/docs/FRONTEND.md +++ b/docs/FRONTEND.md @@ -1,4 +1,4 @@ - + # Frontend Record frontend architecture, component ownership, accessibility constraints, and browser support here. diff --git a/docs/RECORDS_KEEPER.md b/docs/RECORDS_KEEPER.md index d3992aa83..83126bc66 100644 --- a/docs/RECORDS_KEEPER.md +++ b/docs/RECORDS_KEEPER.md @@ -1,4 +1,4 @@ - + # Records Keeper The records keeper keeps repo memory ordered after meaningful changes. Run this checklist at milestone close, after architecture changes, after product behavior changes, and whenever docs/source disagree. diff --git a/docs/generated/db-schema.md b/docs/generated/db-schema.md index 7de63e6ac..f79294a01 100644 --- a/docs/generated/db-schema.md +++ b/docs/generated/db-schema.md @@ -1,4 +1,4 @@ - + # Database Schema Generated or refreshed schema notes belong here. Do not hand-maintain stale schema copies. diff --git a/docs/plans/todo-triage-2026-05-06-plan.md b/docs/plans/todo-triage-2026-05-06-plan.md new file mode 100644 index 000000000..5c2455ea5 --- /dev/null +++ b/docs/plans/todo-triage-2026-05-06-plan.md @@ -0,0 +1,54 @@ +# TODO Inbox Triage Plan — 2026-05-06 + +## Summary + +Root `TODO.md` is a raw dump inbox, not a roadmap. The 2026-05-06 dump has been promoted into this durable plan and cross-referenced with existing roadmap/design documents. Future agents should use this plan and the referenced docs instead of treating the old raw dump as instruction. + +## Existing Durable Homes + +These raw notes already had a suitable home and should be continued there: + +| Raw note | Durable home | Disposition | +|---|---|---| +| Product-audit phase auto-fire | `BUILD_PLAN.md` Tier 1+ active follow-up | Existing roadmap item | +| Extended config-overlay keys | `BUILD_PLAN.md` Tier 1.4 | Existing v3 blocker | +| Architecture doc auto-update | `BUILD_PLAN.md` Tier 2.2 | Existing strong item | +| Semantic checkpoint chapters | `BUILD_PLAN.md` Tier 2.3 | Existing v3.1 item | +| Custom Anthropic SSE parser | `BUILD_PLAN.md` Tier 0 | Existing deferred port | +| Symlinked package dedup | `BUILD_PLAN.md` Tier 0 | Existing port item | +| Extension API `setWorkingVisible()` | `BUILD_PLAN.md` Tier 0 | Existing port item | +| Cloudflare Workers AI provider | `BUILD_PLAN.md` Tier 0 | Existing provider item | +| Azure Cognitive Services base URL | `BUILD_PLAN.md` Tier 0 | Existing provider item | +| Local LLM SSE timeout | `BUILD_PLAN.md` Tier 0 | Already marked done | +| Bedrock inference profile normalization | `BUILD_PLAN.md` Tier 0 | Already marked done | +| gsd-2 safety/correctness ports | `BUILD_PLAN.md` Tier 0.5 | Existing critical-port list | +| Self-report feedback loop | `docs/dev/UOK-SELF-EVOLUTION.md` quick win 1 | Existing self-evolution plan | +| Continuous model learning | `docs/dev/UOK-SELF-EVOLUTION.md` quick win 2 | Existing self-evolution plan | +| Automated knowledge injection | `docs/dev/UOK-SELF-EVOLUTION.md` quick win 3 | Existing self-evolution plan | +| Gate pattern aggregation, adaptive thresholds, hypothesis testing, regression detection | `docs/dev/UOK-SELF-EVOLUTION.md` medium-term roadmap | Existing self-evolution plan | + +## Newly Promoted Roadmap Items + +These were not clearly represented as durable roadmap items and should be planned as slices before implementation: + +| Item | Why | Suggested tier | Implementation note | +|---|---|---|---| +| Typed SF environment schema | `SF_*` env vars should fail early with actionable diagnostics instead of late runtime surprises. | Tier 1 | Add an SF-owned env schema module and route startup/tool validation through it. | +| Autonomous-path coverage ratchet | Global coverage thresholds are too broad; autonomous/recovery paths need higher targeted confidence. | Tier 2 | Start with file-family thresholds or focused test suites for dispatch, recovery, UOK runtime, and validation. | +| End-to-end milestone lifecycle tests | DB-only runtime state needs integration proof across plan, execute, validate, and complete. | Tier 2 | Add a minimal lifecycle fixture that exercises DB rows as executable truth. | +| Fault-injection recovery tests | Stuck-loop, timeout, runaway, stale lock, and projection drift recovery are high-risk paths. | Tier 2 | Add deterministic fault fixtures before adding broader chaos coverage. | +| MCP package completeness audit | Docs mention MCP surfaces, but production completeness is unclear. | Tier 2 | Inspect `packages/mcp-server/`, record supported contracts, gaps, and deferred work. | +| Biome schema version cleanup | Tooling drift creates noisy lint/config failures. | Tier 3 | Run `biome migrate` as a focused tooling cleanup. | +| Headless assistant-text preview completion | Prior headless work deferred buffer separation. | Tier 2 | Finish `assistantTextBuffer` / `thinkingBuffer` separation and preview flushing. | + +## Explicitly Deferred + +| Item | Decision | +|---|---| +| `auto-dispatch.js` / `auto-prompts.js` decomposition | Known design debt, but explicitly out of scope until requested as a dedicated refactor. Do not start it while fixing DB authority, UOK safety, or roadmap triage. | + +## Acceptance Criteria + +- `TODO.md` contains no untriaged raw notes. +- New work starts from this plan, `BUILD_PLAN.md`, or `docs/dev/UOK-SELF-EVOLUTION.md`, not from deleted raw dump text. +- Items that need implementation are converted into SF milestone/slice/task state before code changes begin. diff --git a/docs/product-specs/index.md b/docs/product-specs/index.md index 73d9c544e..b7abe0bbf 100644 --- a/docs/product-specs/index.md +++ b/docs/product-specs/index.md @@ -1,4 +1,4 @@ - + # Product Specs Durable user-facing behavior, workflows, and product decisions live here. diff --git a/docs/references/design-system-reference-llms.txt b/docs/references/design-system-reference-llms.txt index 412ecec16..8ae16d2b8 100644 --- a/docs/references/design-system-reference-llms.txt +++ b/docs/references/design-system-reference-llms.txt @@ -1,2 +1,2 @@ - + Reference slot for design-system guidance intended for LLM consumption. diff --git a/docs/references/nixpacks-llms.txt b/docs/references/nixpacks-llms.txt index 1c0e4e8fa..1f201b6f9 100644 --- a/docs/references/nixpacks-llms.txt +++ b/docs/references/nixpacks-llms.txt @@ -1,2 +1,2 @@ - + Reference slot for Nixpacks deployment/build guidance intended for LLM consumption. diff --git a/docs/references/uv-llms.txt b/docs/references/uv-llms.txt index f81049ee3..8d72d0836 100644 --- a/docs/references/uv-llms.txt +++ b/docs/references/uv-llms.txt @@ -1,2 +1,2 @@ - + Reference slot for uv/Python tooling guidance intended for LLM consumption.