sf snapshot: uncommitted changes after 321m inactivity

This commit is contained in:
Mikael Hugo 2026-05-06 21:53:05 +02:00
parent 48fb05aad8
commit 8fd59e156d
9 changed files with 67 additions and 129 deletions

128
TODO.md
View file

@ -5,131 +5,15 @@ runtime memory or an approved backlog.
## Untriaged Notes
### Feature Gaps & Limitations (2026-05-06)
#### Critical Path Gaps
- **Monolithic auto-dispatch/auto-prompts files** (61KB, 123KB) — acknowledged in ADR-001 as needing decomposition. Blocks easy navigation and testing of autonomous dispatch logic.
- **Extension-provided models incomplete** — Extensions cannot reliably register custom model variants. Model selection system needs refactoring to expose before_model_select hook properly.
- **No typed environment schema** — SF_* env vars have no runtime validation. Missing config is silent and hard to debug. Need zod/io-ts schema in env.ts.
#### Backlog Features (BUILD_PLAN.md Tier 1-2)
- **Product-audit phase auto-fire** — Tool callable but PhaseMerge/PhaseComplete dispatch not wired. Manual ports from gsd-2 needed.
- **Extended config-overlay keys** — Missing context_compact_at, context_hard_limit, unit_timeout_by_phase, max_agents_by_phase, turn_input_required, hot_cache_turns, etc. Users cannot tune critical perf/timeout settings.
- **Architecture doc auto-update** — No fast-dispatch at phase-end to detect if ARCHITECTURE.md/CONVENTIONS.md/STACK.md drifted. Auto-propose diffs for user approval.
- **Semantic checkpoint chapters** — No per-turn semantic "chapter" grouping for crash-resume context. Phase transitions inferred but not labeled. Impacts Hindsight recall usefulness.
- **Custom Anthropic SSE parser** — Still using @anthropic-ai/sdk client.messages.stream(). Should port pi-mono custom SSE parser (~200 LOC, 3 commits) to filter unknown-event + handle proxy events (issue #3708).
- **Symlinked package dedup** — Selectors/loaders show duplicates when packages/resources/skills/sessions symlinked (dev, CI). Port from pi-mono PR #3818.
- **Extension API setWorkingVisible()** — ctx.ui.setWorkingVisible() not yet added. Prevents extensions from hiding built-in working-loader; limits TUI customization.
#### Provider Expansion (BUILD_PLAN.md Tier 0.5 - gsd-2 ports)
- **Cloudflare Workers AI provider** — Not yet in routing list. Ready in pi-mono PR #3851; 1-line port.
- **Azure Cognitive Services base URL** — Azure OpenAI Responses endpoint support not ported from pi-mono PR #3799.
- **Local LLM SSE timeout (5-min cutoff)** — Ollama/LM Studio over 5 min hit UND_ERR_BODY_TIMEOUT. Fix available in pi-mono `d0907b6d8` (1 commit).
- **Bedrock inference profile normalization** — Prompt-caching checks fail on inference profile ARNs. Fix in pi-mono `7c487bb60` (1 commit).
#### Testing Gaps
- **Coverage thresholds too low** — 40% is acceptable but should be 60%+ for autonomous/critical paths (auto-dispatch, recovery, state machine). Add property-based testing (fast-check) for state transitions.
- **No end-to-end milestone lifecycle tests** — Missing integration tests covering full milestone flow.
- **No fault-injection/chaos tests** — Recovery paths (stuck-loop detection, timeout recovery, runaway guards) lack chaos/fault-injection coverage.
#### Minor/Polish
- **Biome schema version mismatch** — biome.json v2.4.13 vs CLI v2.4.14. Run `biome migrate`.
- **MCP package completeness unclear** — docs reference mcp-server but completeness unknown. Verify packages/mcp-server/ is production-ready and document status.
- **Headless assistant-text preview deferred** — Notification categorization done; assistantTextBuffer/thinkingBuffer separation incomplete (see headless.ts).
#### gsd-2 Safety/Correctness Ports (BUILD_PLAN.md Tier 0.5)
- **Bash evidence persistence race** — Close mid-unit re-dispatch race (gsd-2 `da7dd56e7`, PR #5056#5058). Bash tool calls can lose evidence between dispatch and re-dispatch.
- **Project-controlled surface hardening** — Full fix supersedes partial cherry-pick at `66ff949c1` (gsd-2 `65ca5aa2e`).
- **Web_search injection narrowing** — Only inject web_search when provider accepts it (gsd-2 `4370bedf3`).
- **Symlinked .sf staging self-heal** — Data-loss prevention when staging dir is symlink (gsd-2 `9340f1e9b`, PR #4423).
- **Milestone KNOWLEDGE injection budgeting** — Prevents scope knowledge from blowing context budget (gsd-2 `58d3d4d6c`, PR #4721).
- **MCP-server stdout-buffer deadlock** — Large-output MCP tools could hang (gsd-2 `bb747ec57`).
- **Workflow state machine race protection** — Session transitions during agent_end, idle-wait optimization (gsd-2 commits `71114fccf`, `6d7e4ccb5`, `c162c44bf`, `e3bd04551`).
- **Claude-Code CLI Always-Allow persistence** — Grant persistence for non-Bash tools (gsd-2 `a88baeae9`, PR #5096).
### UOK Self-Evolution Research (2026-05-06)
#### Overview
Research into whether SF's UOK (Unified Operation Kernel) is best-in-class for a self-evolving coder agent. Full research report: see session research folder.
**Verdict:** UOK is excellent for deterministic autonomous dispatch (beats typical LLM agents, rivals enterprise orchestrators) but only 60-70% complete for true self-evolution. Learning infrastructure exists but isn't actively used.
#### Critical Finding: Documentation Gap
The implementation has **10+ undocumented features** not explained in ARCHITECTURE.md:
- Five-phase state machine with error recovery paths (PhaseDiscuss → PhasePlan → PhaseExecute → PhaseMerge → PhaseComplete)
- Gate-runner verdict semantics (passed/failed/omitted) + re-dispatch rules
- Outcome learning for model selection (Bayesian blending)
- Stuck-loop detection with recovery thresholds
- Evidence persistence and audit trails
- Sophisticated edge-case handling not documented
**Action item:** Update ARCHITECTURE.md with full state machine diagram, gate semantics, and recovery paths.
#### Self-Evolution Status
**Infrastructure that works:**
- ✅ Self-report collection (sf_self_report captures anomalies during dispatch/validation)
- ✅ Outcome learning (Bayesian model selection per task-type)
- ✅ Knowledge compounding (KNOWLEDGE.md with judgment-log entries)
- ✅ Gate-based pattern detection (gates can detect repeated failures)
**Feedback loop that's missing:**
- ❌ Triage pipeline — self-reports collected but not processed into fixes
- ❌ Continuous model tuning — learning exists but infrequent, not aggressive
- ❌ Automated knowledge injection — knowledge exists but not used in prompts
- ❌ Cross-gate pattern aggregation — gates run independently, don't see patterns
- ❌ Adaptive thresholds — all timeouts hardcoded, not data-driven
- ❌ Hypothesis testing — no A/B test framework for improvements
- ❌ Regression detection — no metrics monitoring for quality drift
#### Top 3 Improvements (Quick Wins)
1. **Close self-report feedback loop** [9/10 impact, 4/10 effort, 2-3 days]
- Auto-triage self-reports, create work items for fixes, promote high-confidence improvements to code
- **Why:** Reports are collected but ignored; this closes the feedback loop
- **Implementation:** Extend commands-todo.js triage logic to process sf_self_report events
2. **Activate continuous model learning** [8/10 impact, 5/10 effort, 3-4 days]
- Track model success/failure per task type + latency + cost; auto-demote failing models; A/B test new models on low-risk tasks
- **Why:** Learning exists but is dormant; this makes dispatch adaptive
- **Implementation:** Enhance benchmark-selector.ts + model-router logic with aggressive per-task-type tracking
3. **Automate knowledge injection** [7/10 impact, 4/10 effort, 2-3 days]
- Auto-query KNOWLEDGE.md for relevance during milestone planning; inject high-confidence learnings; flag contradictions
- **Why:** Knowledge exists but isn't used; this makes it actionable
- **Implementation:** Add to auto-prompts.js knowledge-injection stage; use semantic similarity scoring
**Quick-win total:** ~8-10 days for high-leverage improvements that activate the learning loop.
#### Additional Improvements (Medium-Term, 1-2 Months)
4. **Continuous gate pattern aggregation** [8/10 impact, 6/10 effort, 3-4 days] — After each phase, detect common gate failure themes across all gates; aggregate into consolidated self-reports; suggest architectural fixes.
5. **Adaptive timeout tuning** [7/10 impact, 6/10 effort, 3-4 days] — Replace hardcoded timeouts with data-driven values based on task execution history; auto-adjust per task-type.
6. **Hypothesis testing framework** [9/10 impact, 7/10 effort, 4-5 days] — A/B test improvements on low-stakes tasks; roll back if they introduce regressions; never ship untested changes.
7. **Cross-milestone federated learning** [8/10 impact, 9/10 effort, 8-10 days] — Share generalizable learnings across projects (same org); test on similar projects first; expand based on results.
8. **Regression detection & prevention** [7/10 impact, 8/10 effort, 5-6 days] — Track key metrics (success rate, latency, cost, gate failures) across milestones; alert on regressions; auto-rollback bad changes.
9. **Semantic drift detection** [6/10 impact, 7/10 effort, 4-5 days] — Detect when prompts/gate logic have drifted from original intent; file alerts; suggest reverting or documenting.
10. **Self-hosted telemetry & profiling** [5/10 impact, 8/10 effort, 4-5 days] — When SF runs on itself (dogfooding), profile which phases/gates/model-selections take longest; prioritize optimizations.
**Medium-term total:** ~25-30 days for comprehensive self-evolution roadmap.
#### Documentation That Should Be Updated
- [ ] ARCHITECTURE.md — Full state machine diagram with phase transitions and error recovery paths
- [ ] docs/dev/ADR-* — Document gate verdict semantics (passed/failed/omitted) and re-dispatch behavior
- [ ] User docs — Explain outcome learning, model selection tuning, knowledge compounding workflow
- [ ] Runbook — Stuck-loop detection, timeout adjustment, recovery paths
- [ ] Design guide — Best practices for implementing custom gates with pattern detection
- [ ] New section: "Self-Evolution Architecture" explaining feedback loops, learning mechanisms, and how to extend them
No untriaged notes. Add raw dumps here temporarily, then promote them to
`docs/plans/`, `docs/adr/`, `docs/specs/`, or another durable project artifact
before starting implementation.
## Processed Notes
- SF auto-loop hardening was converted into SF milestone state on 2026-05-02.
Continue from `.sf/STATE.md`, `.sf/milestones/M013/`, and
`.sf/milestones/M014/` instead of reusing the old raw dump.
- Feature gaps and UOK self-evolution research from 2026-05-06 were triaged on
2026-05-06 into `docs/plans/todo-triage-2026-05-06-plan.md` and existing
durable docs such as `docs/dev/UOK-SELF-EVOLUTION.md`.

View file

@ -1,4 +1,4 @@
<!-- sf-doc: version=2.75.3 template=docs/FRONTEND.md state=pending hash=sha256:03087953d690c9902d35297720d1482262c1610e3050084f891db3be711571ef -->
<!-- sf-doc: version=0.0.0 template=docs/FRONTEND.md state=pending hash=sha256:03087953d690c9902d35297720d1482262c1610e3050084f891db3be711571ef -->
# Frontend
Record frontend architecture, component ownership, accessibility constraints, and browser support here.

View file

@ -1,4 +1,4 @@
<!-- sf-doc: version=2.75.3 template=docs/RECORDS_KEEPER.md state=pending hash=sha256:3872de9cd72bd9129814a5e77e3b86abe76bef33f3ca34e04ae7582b4cfd066a -->
<!-- sf-doc: version=0.0.0 template=docs/RECORDS_KEEPER.md state=pending hash=sha256:3872de9cd72bd9129814a5e77e3b86abe76bef33f3ca34e04ae7582b4cfd066a -->
# Records Keeper
The records keeper keeps repo memory ordered after meaningful changes. Run this checklist at milestone close, after architecture changes, after product behavior changes, and whenever docs/source disagree.

View file

@ -1,4 +1,4 @@
<!-- sf-doc: version=2.75.3 template=docs/generated/db-schema.md state=pending hash=sha256:8488a607c1a2981654a3b030600d2e10627d132ebd0c75700648a08dede93368 -->
<!-- sf-doc: version=0.0.0 template=docs/generated/db-schema.md state=pending hash=sha256:8488a607c1a2981654a3b030600d2e10627d132ebd0c75700648a08dede93368 -->
# Database Schema
Generated or refreshed schema notes belong here. Do not hand-maintain stale schema copies.

View file

@ -0,0 +1,54 @@
# TODO Inbox Triage Plan — 2026-05-06
## Summary
Root `TODO.md` is a raw dump inbox, not a roadmap. The 2026-05-06 dump has been promoted into this durable plan and cross-referenced with existing roadmap/design documents. Future agents should use this plan and the referenced docs instead of treating the old raw dump as instruction.
## Existing Durable Homes
These raw notes already had a suitable home and should be continued there:
| Raw note | Durable home | Disposition |
|---|---|---|
| Product-audit phase auto-fire | `BUILD_PLAN.md` Tier 1+ active follow-up | Existing roadmap item |
| Extended config-overlay keys | `BUILD_PLAN.md` Tier 1.4 | Existing v3 blocker |
| Architecture doc auto-update | `BUILD_PLAN.md` Tier 2.2 | Existing strong item |
| Semantic checkpoint chapters | `BUILD_PLAN.md` Tier 2.3 | Existing v3.1 item |
| Custom Anthropic SSE parser | `BUILD_PLAN.md` Tier 0 | Existing deferred port |
| Symlinked package dedup | `BUILD_PLAN.md` Tier 0 | Existing port item |
| Extension API `setWorkingVisible()` | `BUILD_PLAN.md` Tier 0 | Existing port item |
| Cloudflare Workers AI provider | `BUILD_PLAN.md` Tier 0 | Existing provider item |
| Azure Cognitive Services base URL | `BUILD_PLAN.md` Tier 0 | Existing provider item |
| Local LLM SSE timeout | `BUILD_PLAN.md` Tier 0 | Already marked done |
| Bedrock inference profile normalization | `BUILD_PLAN.md` Tier 0 | Already marked done |
| gsd-2 safety/correctness ports | `BUILD_PLAN.md` Tier 0.5 | Existing critical-port list |
| Self-report feedback loop | `docs/dev/UOK-SELF-EVOLUTION.md` quick win 1 | Existing self-evolution plan |
| Continuous model learning | `docs/dev/UOK-SELF-EVOLUTION.md` quick win 2 | Existing self-evolution plan |
| Automated knowledge injection | `docs/dev/UOK-SELF-EVOLUTION.md` quick win 3 | Existing self-evolution plan |
| Gate pattern aggregation, adaptive thresholds, hypothesis testing, regression detection | `docs/dev/UOK-SELF-EVOLUTION.md` medium-term roadmap | Existing self-evolution plan |
## Newly Promoted Roadmap Items
These were not clearly represented as durable roadmap items and should be planned as slices before implementation:
| Item | Why | Suggested tier | Implementation note |
|---|---|---|---|
| Typed SF environment schema | `SF_*` env vars should fail early with actionable diagnostics instead of late runtime surprises. | Tier 1 | Add an SF-owned env schema module and route startup/tool validation through it. |
| Autonomous-path coverage ratchet | Global coverage thresholds are too broad; autonomous/recovery paths need higher targeted confidence. | Tier 2 | Start with file-family thresholds or focused test suites for dispatch, recovery, UOK runtime, and validation. |
| End-to-end milestone lifecycle tests | DB-only runtime state needs integration proof across plan, execute, validate, and complete. | Tier 2 | Add a minimal lifecycle fixture that exercises DB rows as executable truth. |
| Fault-injection recovery tests | Stuck-loop, timeout, runaway, stale lock, and projection drift recovery are high-risk paths. | Tier 2 | Add deterministic fault fixtures before adding broader chaos coverage. |
| MCP package completeness audit | Docs mention MCP surfaces, but production completeness is unclear. | Tier 2 | Inspect `packages/mcp-server/`, record supported contracts, gaps, and deferred work. |
| Biome schema version cleanup | Tooling drift creates noisy lint/config failures. | Tier 3 | Run `biome migrate` as a focused tooling cleanup. |
| Headless assistant-text preview completion | Prior headless work deferred buffer separation. | Tier 2 | Finish `assistantTextBuffer` / `thinkingBuffer` separation and preview flushing. |
## Explicitly Deferred
| Item | Decision |
|---|---|
| `auto-dispatch.js` / `auto-prompts.js` decomposition | Known design debt, but explicitly out of scope until requested as a dedicated refactor. Do not start it while fixing DB authority, UOK safety, or roadmap triage. |
## Acceptance Criteria
- `TODO.md` contains no untriaged raw notes.
- New work starts from this plan, `BUILD_PLAN.md`, or `docs/dev/UOK-SELF-EVOLUTION.md`, not from deleted raw dump text.
- Items that need implementation are converted into SF milestone/slice/task state before code changes begin.

View file

@ -1,4 +1,4 @@
<!-- sf-doc: version=2.75.3 template=docs/product-specs/index.md state=pending hash=sha256:ca3477e8d74fe277a2e0b2cdb3f03c235e294015a6ece2f571a82acc7475d31c -->
<!-- sf-doc: version=0.0.0 template=docs/product-specs/index.md state=pending hash=sha256:ca3477e8d74fe277a2e0b2cdb3f03c235e294015a6ece2f571a82acc7475d31c -->
# Product Specs
Durable user-facing behavior, workflows, and product decisions live here.

View file

@ -1,2 +1,2 @@
<!-- sf-doc: version=2.75.3 template=docs/references/design-system-reference-llms.txt state=pending hash=sha256:5a5a35a3f80c8b4433ad30c1f155b1e8c7fd245ce2a3def9627daa9f40854eb3 -->
<!-- sf-doc: version=0.0.0 template=docs/references/design-system-reference-llms.txt state=pending hash=sha256:5a5a35a3f80c8b4433ad30c1f155b1e8c7fd245ce2a3def9627daa9f40854eb3 -->
Reference slot for design-system guidance intended for LLM consumption.

View file

@ -1,2 +1,2 @@
<!-- sf-doc: version=2.75.3 template=docs/references/nixpacks-llms.txt state=pending hash=sha256:22f9a8549e3ced71d0b0a912c6dcdfb2ec83a573168ee1b44ca266f1eb0307bf -->
<!-- sf-doc: version=0.0.0 template=docs/references/nixpacks-llms.txt state=pending hash=sha256:22f9a8549e3ced71d0b0a912c6dcdfb2ec83a573168ee1b44ca266f1eb0307bf -->
Reference slot for Nixpacks deployment/build guidance intended for LLM consumption.

View file

@ -1,2 +1,2 @@
<!-- sf-doc: version=2.75.3 template=docs/references/uv-llms.txt state=pending hash=sha256:e8a998667c0f830a15b68e207f6b69e6377dd7e82728833f842678f72864e9b6 -->
<!-- sf-doc: version=0.0.0 template=docs/references/uv-llms.txt state=pending hash=sha256:e8a998667c0f830a15b68e207f6b69e6377dd7e82728833f842678f72864e9b6 -->
Reference slot for uv/Python tooling guidance intended for LLM consumption.