# UOK Self-Evolution Architecture This document explains how Singularity Forge's UOK (Unified Operation Kernel) implements self-evolution — the ability to detect its own failures, learn from them, and improve its own heuristics and dispatch logic. ## Status Summary **Current state:** 60-70% complete. Infrastructure exists; learning loop not fully activated. **What works:** - ✅ Self-report collection during dispatch/validation - ✅ Outcome learning for model selection (Bayesian) - ✅ Knowledge compounding (KNOWLEDGE.md) - ✅ Gate-based pattern detection capability **What's missing:** - ❌ Automated triage pipeline (reports not processed into fixes) - ❌ Continuous model tuning (learning episodic, not aggressive) - ❌ Automated knowledge injection (knowledge not used in prompts) - ❌ Cross-gate pattern aggregation (gates run independently) - ❌ Adaptive thresholds (timeouts hardcoded, not data-driven) - ❌ Hypothesis testing (no A/B test framework for improvements) - ❌ Regression detection (no metrics monitoring) --- ## 1. Self-Report Collection **Purpose:** Capture SF-internal observations when something unexpected happens during dispatch, validation, or gate runs. **Implementation:** Agents and gates call `sf_self_report(issue, severity, context)` **Examples of self-reports filed in production:** - "validation-reviewer prompt lacks explicit rubric for criterion vs. implementation gap" [low] - "model timeout on large input (>100K tokens)" [warning] - "gate inconsistency: requirement-coverage gate failed in S02 but passed in S03 for same requirement" [warning] **Storage:** - Runtime: `~/.sf/agent/upstream-feedback.jsonl` (per-session) - Dogfooding: `.sf/SELF-FEEDBACK.md` (when SF runs on itself) **Current limitation:** - Reports are collected but not systematically processed - No automatic triage, dedup, or promotion to code fixes - No feedback-to-code pipeline **Improvement:** See "Top 3 Quick Wins" below. --- ## 2. Outcome Learning for Model Selection **Purpose:** Track which models succeed/fail on different task types, then route future tasks to higher-scoring models. **Mechanism:** UOK maintains a per-task-type model performance matrix: ``` model_scores[task_type] = { claude-sonnet: { successes: 42, failures: 3, latency_ms: [450, 520, ...] }, claude-opus: { successes: 8, failures: 2, latency_ms: [800, 850, ...] }, minimax: { successes: 15, failures: 10, latency_ms: [350, 400, ...] } } ``` Bayesian update after each task: ``` P(model_i succeeds | task_type) = (successes + prior_weight) / (total_trials + prior_weight) ``` - Default priors give new/experimental models benefit of the doubt - Different priors for different model classes (Claude gets higher prior than experimental) - Used by `benchmark-selector.ts` to pick the best model for next task **Location:** Computed during phase transitions; stored in `sf.db` outcome logs. **Current limitations:** - Learning updates episodically (per-task completion), not continuously - Success/failure is binary — doesn't distinguish "slow success" from "fast success" - Only applied to task dispatch, not to gate routing - Recovery paths don't feed learning back **Improvement:** Make learning aggressive per-task-type with latency/cost tracking. --- ## 3. Knowledge Compounding **Purpose:** Extract high-confidence learnings from completed work and make them available to future milestones. **Storage:** `KNOWLEDGE.md` with structured judgment-log entries. **Format:** ```markdown ## [2026-05-06] Python 3.12 stdlib compatibility **Verdict:** Active issue — avoid for now **Evidence:** Task T02 in M010/S03 discovered that `asyncore` module removed in Python 3.12. Affects legacy integrations. **Confidence:** 0.95 (observed failure in live deployment) **Recommendation:** Constrain to Python <3.11 in requirements.txt; add explicit warning for users on 3.12. ``` **How it should work (ideally):** 1. After slice completion, `memory-extractor.ts` distills high-confidence learnings (confidence >0.8) 2. Next milestone dispatch checks KNOWLEDGE.md for relevance 3. Relevant knowledge injected into dispatch prompts automatically 4. Contradictory knowledge flagged as potential architectural drift **Current state:** - Storage works (KNOWLEDGE.md well-formatted) - Extraction works (memory-extractor.ts analyzes task results) - Injection is **manual** (must explicitly configure in prompts) - No automatic relevance matching - No conflict resolution for contradictory knowledge **Improvement:** Automate knowledge injection with semantic relevance scoring. --- ## 4. Gate-Based Pattern Detection **Purpose:** Gates can detect and report repeated failure patterns, signaling potential design flaws. **Example:** - Requirement-coverage gate fails in S01 (requirement X not covered) - Requirement-coverage gate fails in S03 (same requirement X not covered) - Gate files self-report: "Requirement X failed coverage in multiple slices — suggests design flaw or missing slice" **Current state:** - Logic exists in individual gates - Each gate runs independently (no cross-gate pattern aggregation) - Patterns must be explicitly coded in gate logic (not automatic) - No framework for gate authors to easily add pattern detection **Improvement:** Add cross-gate pattern aggregation + automatic theme detection. --- ## Top 3 Quick Wins (8-10 Days Total) ### 1. Close Self-Report Feedback Loop [9/10 impact, 4/10 effort, 2-3 days] **What:** Create an automated triage pipeline that processes self-reports into actionable fixes. **Implementation:** - Extend `commands-todo.js` triage logic to parse `upstream-feedback.jsonl` - Triage rules: - Dedup identical reports (same issue filed multiple times) - Classify by severity: blocker | warning | suggestion - Auto-create backlog work items for blockers/warnings - For high-confidence fixes (e.g., "prompt lacks rubric"), propose the fix directly - Promote fixes into code via new SF slice **Why:** Reports are collected but ignored. This closes the feedback loop. **Code locations:** - `src/resources/extensions/sf/commands-handlers.js` (sf_self_report implementation) - `src/resources/extensions/sf/commands/handlers/todo.js` (triage logic to extend) - `src/resources/extensions/sf/commands-todo.js` (triage prompt + tool) --- ### 2. Activate Continuous Model Learning [8/10 impact, 5/10 effort, 3-4 days] **What:** Make model selection adaptive, per-task-type, with failure analysis and automatic demotion. **Current state:** - Outcome tracking exists but learning is infrequent - All model routing decisions are mostly static (based on configuration, not history) **Improvements:** - Track per-task-type: success rate, latency, cost, token efficiency - Auto-demote models that fail >50% on specific task types - A/B test new models against incumbent on low-risk tasks - Log detailed failure analytics (why did this model fail? timeouts? quality?) **Why:** Learning exists but is dormant; this makes dispatch adaptive. **Code locations:** - `packages/pi-ai/src/model-router.ts` (model selection logic) - `src/auto-dispatch.ts` (outcome logging, task tracking) - `src/resources/extensions/sf/commands/benchmark-selector.ts` (model scoring display) --- ### 3. Automate Knowledge Injection [7/10 impact, 4/10 effort, 2-3 days] **What:** During milestone planning, automatically query KNOWLEDGE.md for relevant learnings and inject them into dispatch prompts. **Current state:** - KNOWLEDGE.md exists and is populated - Agents never see it (unless manually configured) **Improvements:** - At planning time, query KNOWLEDGE.md with semantic similarity scoring - Inject high-confidence (>0.8) relevant knowledge into `execute-task`, `plan-slice` prompts - Flag contradictory knowledge (e.g., "avoid Python 3.12" vs. "adopt Python 3.12") for review - Track which knowledge was actually used (feedback to knowledge compounding) **Why:** Knowledge exists but isn't used; this makes it actionable. **Code locations:** - `src/resources/extensions/sf/auto-prompts.js` (where prompts are loaded; add knowledge injection here) - `src/resources/extensions/sf/prompts/execute-task.md`, `plan-slice.md` (templates that should reference {{knowledgeInjection}}) - New module: `src/resources/extensions/sf/knowledge-injector.ts` (semantic matching logic) --- ## Additional Improvements (Medium-Term, 1-2 Months) ### 4. Continuous Gate Pattern Aggregation [8/10 impact, 6/10 effort, 3-4 days] After each phase, scan all gate failures for common themes. Aggregate into consolidated self-reports. Suggest architectural fixes. ### 5. Adaptive Timeout Tuning [7/10 impact, 6/10 effort, 3-4 days] Replace hardcoded timeouts with data-driven values based on task execution history. Auto-adjust per task-type. ### 6. Hypothesis Testing Framework [9/10 impact, 7/10 effort, 4-5 days] A/B test improvements on low-stakes tasks. Roll back if they introduce regressions. Never ship untested changes. ### 7. Cross-Milestone Federated Learning [8/10 impact, 9/10 effort, 8-10 days] Share generalizable learnings across projects (same org). Test on similar projects first. ### 8. Regression Detection & Prevention [7/10 impact, 8/10 effort, 5-6 days] Track metrics across milestones. Alert on regressions. Auto-rollback bad changes. ### 9. Semantic Drift Detection [6/10 impact, 7/10 effort, 4-5 days] Detect when prompts/gate logic have drifted from original intent. Suggest reverting or documenting. ### 10. Self-Hosted Telemetry [5/10 impact, 8/10 effort, 4-5 days] When SF runs on itself (dogfooding), profile which phases/gates take longest. Prioritize optimizations. --- ## Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────────┐ │ UOK Dispatch Loop │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌────────────────────────┐ │ PhaseDiscuss/Plan/ │ │ Execute/Merge/ │ │ Complete │ └────────────────────────┘ ↓ ↓ ┌──────────────────┐ ┌──────────────────┐ │ Outcome │ │ Gates Run │ │ Logging │ │ Parallel │ └──────────────────┘ └──────────────────┘ ↓ ↓ ┌──────────────────────────────────────┐ │ sf.db: Outcome Ledger + Gate Results│ └──────────────────────────────────────┘ ↓ ┌──────────────────────────────────────────────┐ │ Self-Report Collection │ │ (agents + gates file anomalies) │ └──────────────────────────────────────────────┘ ↓ ↓ ↓ ┌────────┐ ┌────────────┐ ┌─────────────┐ │ TBD: │ │ Learning: │ │ Knowledge: │ │ Triage │ │ Model │ │ Compounding │ │ Loop │ │ Selection │ │ (KOWLEDGE.md) └────────┘ └────────────┘ └─────────────┘ ↓ ↓ ┌─────────────────────────────────────────┐ │ TBD: Automated │ │ Knowledge Injection │ │ (into next dispatch) │ └─────────────────────────────────────────┘ (TBD = To Be Done; strikethrough items are implemented but inactive) ``` --- ## How to Contribute To improve self-evolution, pick one of the quick wins above: 1. **Study the code:** Understand how self-reports are filed, how outcome logging works, how KNOWLEDGE.md is structured 2. **Write a failing test:** Define expected behavior (e.g., "when self-report severity is 'blocker', it creates a backlog item") 3. **Implement the improvement:** Follow SF coding conventions (see CONTRIBUTING.md) 4. **Test thoroughly:** Especially recovery paths and edge cases 5. **Document:** Update this file and ARCHITECTURE.md as behavior changes --- ## References - **Outcome Learning:** `src/auto-dispatch.ts` (outcome logging), `packages/pi-ai/src/model-router.ts` (model selection) - **Self-Reports:** `src/resources/extensions/sf/commands-handlers.js` (sf_self_report), `upstream-feedback.jsonl` (storage) - **Knowledge:** `KNOWLEDGE.md` (storage), `src/resources/extensions/sf/memory-extractor.js` (extraction) - **Gates:** `src/resources/extensions/sf/prompts/gate-evaluate.md` (gate orchestration) - **TODO:** See `TODO.md` and `BACKLOG.md` for prioritized work