301 lines
14 KiB
Markdown
301 lines
14 KiB
Markdown
# UOK Self-Evolution Architecture
|
|
|
|
This document explains how Singularity Forge's UOK (Unified Operation Kernel) implements self-evolution — the ability to detect its own failures, learn from them, and improve its own heuristics and dispatch logic.
|
|
|
|
## Status Summary
|
|
|
|
**Current state:** 60-70% complete. Infrastructure exists; learning loop not fully activated.
|
|
|
|
**What works:**
|
|
- ✅ Self-report collection during dispatch/validation
|
|
- ✅ Outcome learning for model selection (Bayesian)
|
|
- ✅ Knowledge compounding (KNOWLEDGE.md)
|
|
- ✅ Gate-based pattern detection capability
|
|
|
|
**What's missing:**
|
|
- ❌ Automated triage pipeline (reports not processed into fixes)
|
|
- ❌ Continuous model tuning (learning episodic, not aggressive)
|
|
- ❌ Automated knowledge injection (knowledge not used in prompts)
|
|
- ❌ Cross-gate pattern aggregation (gates run independently)
|
|
- ❌ Adaptive thresholds (timeouts hardcoded, not data-driven)
|
|
- ❌ Hypothesis testing (no A/B test framework for improvements)
|
|
- ❌ Regression detection (no metrics monitoring)
|
|
|
|
---
|
|
|
|
## 1. Self-Report Collection
|
|
|
|
**Purpose:** Capture SF-internal observations when something unexpected happens during dispatch, validation, or gate runs.
|
|
|
|
**Implementation:** Agents and gates call `sf_self_report(issue, severity, context)`
|
|
|
|
**Examples of self-reports filed in production:**
|
|
- "validation-reviewer prompt lacks explicit rubric for criterion vs. implementation gap" [low]
|
|
- "model timeout on large input (>100K tokens)" [warning]
|
|
- "gate inconsistency: requirement-coverage gate failed in S02 but passed in S03 for same requirement" [warning]
|
|
|
|
**Storage:**
|
|
- Runtime: `~/.sf/agent/upstream-feedback.jsonl` (per-session)
|
|
- Dogfooding: `.sf/SELF-FEEDBACK.md` (when SF runs on itself)
|
|
|
|
**Current limitation:**
|
|
- Reports are collected but not systematically processed
|
|
- No automatic triage, dedup, or promotion to code fixes
|
|
- No feedback-to-code pipeline
|
|
|
|
**Improvement:** See "Top 3 Quick Wins" below.
|
|
|
|
---
|
|
|
|
## 2. Outcome Learning for Model Selection
|
|
|
|
**Purpose:** Track which models succeed/fail on different task types, then route future tasks to higher-scoring models.
|
|
|
|
**Mechanism:**
|
|
|
|
UOK maintains a per-task-type model performance matrix:
|
|
|
|
```
|
|
model_scores[task_type] = {
|
|
claude-sonnet: { successes: 42, failures: 3, latency_ms: [450, 520, ...] },
|
|
claude-opus: { successes: 8, failures: 2, latency_ms: [800, 850, ...] },
|
|
minimax: { successes: 15, failures: 10, latency_ms: [350, 400, ...] }
|
|
}
|
|
```
|
|
|
|
Bayesian update after each task:
|
|
|
|
```
|
|
P(model_i succeeds | task_type) = (successes + prior_weight) / (total_trials + prior_weight)
|
|
```
|
|
|
|
- Default priors give new/experimental models benefit of the doubt
|
|
- Different priors for different model classes (Claude gets higher prior than experimental)
|
|
- Used by `benchmark-selector.ts` to pick the best model for next task
|
|
|
|
**Location:** Computed during phase transitions; stored in `sf.db` outcome logs.
|
|
|
|
**Current limitations:**
|
|
- Learning updates episodically (per-task completion), not continuously
|
|
- Success/failure is binary — doesn't distinguish "slow success" from "fast success"
|
|
- Only applied to task dispatch, not to gate routing
|
|
- Recovery paths don't feed learning back
|
|
|
|
**Improvement:** Make learning aggressive per-task-type with latency/cost tracking.
|
|
|
|
---
|
|
|
|
## 3. Knowledge Compounding
|
|
|
|
**Purpose:** Extract high-confidence learnings from completed work and make them available to future milestones.
|
|
|
|
**Storage:** `KNOWLEDGE.md` with structured judgment-log entries.
|
|
|
|
**Format:**
|
|
|
|
```markdown
|
|
## [2026-05-06] Python 3.12 stdlib compatibility
|
|
|
|
**Verdict:** Active issue — avoid for now
|
|
|
|
**Evidence:** Task T02 in M010/S03 discovered that `asyncore` module removed in Python 3.12. Affects legacy integrations.
|
|
|
|
**Confidence:** 0.95 (observed failure in live deployment)
|
|
|
|
**Recommendation:** Constrain to Python <3.11 in requirements.txt; add explicit warning for users on 3.12.
|
|
```
|
|
|
|
**How it should work (ideally):**
|
|
1. After slice completion, `memory-extractor.ts` distills high-confidence learnings (confidence >0.8)
|
|
2. Next milestone dispatch checks KNOWLEDGE.md for relevance
|
|
3. Relevant knowledge injected into dispatch prompts automatically
|
|
4. Contradictory knowledge flagged as potential architectural drift
|
|
|
|
**Current state:**
|
|
- Storage works (KNOWLEDGE.md well-formatted)
|
|
- Extraction works (memory-extractor.ts analyzes task results)
|
|
- Injection is **manual** (must explicitly configure in prompts)
|
|
- No automatic relevance matching
|
|
- No conflict resolution for contradictory knowledge
|
|
|
|
**Improvement:** Automate knowledge injection with semantic relevance scoring.
|
|
|
|
---
|
|
|
|
## 4. Gate-Based Pattern Detection
|
|
|
|
**Purpose:** Gates can detect and report repeated failure patterns, signaling potential design flaws.
|
|
|
|
**Example:**
|
|
- Requirement-coverage gate fails in S01 (requirement X not covered)
|
|
- Requirement-coverage gate fails in S03 (same requirement X not covered)
|
|
- Gate files self-report: "Requirement X failed coverage in multiple slices — suggests design flaw or missing slice"
|
|
|
|
**Current state:**
|
|
- Logic exists in individual gates
|
|
- Each gate runs independently (no cross-gate pattern aggregation)
|
|
- Patterns must be explicitly coded in gate logic (not automatic)
|
|
- No framework for gate authors to easily add pattern detection
|
|
|
|
**Improvement:** Add cross-gate pattern aggregation + automatic theme detection.
|
|
|
|
---
|
|
|
|
## Top 3 Quick Wins (8-10 Days Total)
|
|
|
|
### 1. Close Self-Report Feedback Loop [9/10 impact, 4/10 effort, 2-3 days]
|
|
|
|
**What:** Create an automated triage pipeline that processes self-reports into actionable fixes.
|
|
|
|
**Implementation:**
|
|
- Extend `commands-todo.js` triage logic to parse `upstream-feedback.jsonl`
|
|
- Triage rules:
|
|
- Dedup identical reports (same issue filed multiple times)
|
|
- Classify by severity: blocker | warning | suggestion
|
|
- Auto-create backlog work items for blockers/warnings
|
|
- For high-confidence fixes (e.g., "prompt lacks rubric"), propose the fix directly
|
|
- Promote fixes into code via new SF slice
|
|
|
|
**Why:** Reports are collected but ignored. This closes the feedback loop.
|
|
|
|
**Code locations:**
|
|
- `src/resources/extensions/sf/commands-handlers.js` (sf_self_report implementation)
|
|
- `src/resources/extensions/sf/commands/handlers/todo.js` (triage logic to extend)
|
|
- `src/resources/extensions/sf/commands-todo.js` (triage prompt + tool)
|
|
|
|
---
|
|
|
|
### 2. Activate Continuous Model Learning [8/10 impact, 5/10 effort, 3-4 days]
|
|
|
|
**What:** Make model selection adaptive, per-task-type, with failure analysis and automatic demotion.
|
|
|
|
**Current state:**
|
|
- Outcome tracking exists but learning is infrequent
|
|
- All model routing decisions are mostly static (based on configuration, not history)
|
|
|
|
**Improvements:**
|
|
- Track per-task-type: success rate, latency, cost, token efficiency
|
|
- Auto-demote models that fail >50% on specific task types
|
|
- A/B test new models against incumbent on low-risk tasks
|
|
- Log detailed failure analytics (why did this model fail? timeouts? quality?)
|
|
|
|
**Why:** Learning exists but is dormant; this makes dispatch adaptive.
|
|
|
|
**Code locations:**
|
|
- `packages/pi-ai/src/model-router.ts` (model selection logic)
|
|
- `src/auto-dispatch.ts` (outcome logging, task tracking)
|
|
- `src/resources/extensions/sf/commands/benchmark-selector.ts` (model scoring display)
|
|
|
|
---
|
|
|
|
### 3. Automate Knowledge Injection [7/10 impact, 4/10 effort, 2-3 days]
|
|
|
|
**What:** During milestone planning, automatically query KNOWLEDGE.md for relevant learnings and inject them into dispatch prompts.
|
|
|
|
**Current state:**
|
|
- KNOWLEDGE.md exists and is populated
|
|
- Agents never see it (unless manually configured)
|
|
|
|
**Improvements:**
|
|
- At planning time, query KNOWLEDGE.md with semantic similarity scoring
|
|
- Inject high-confidence (>0.8) relevant knowledge into `execute-task`, `plan-slice` prompts
|
|
- Flag contradictory knowledge (e.g., "avoid Python 3.12" vs. "adopt Python 3.12") for review
|
|
- Track which knowledge was actually used (feedback to knowledge compounding)
|
|
|
|
**Why:** Knowledge exists but isn't used; this makes it actionable.
|
|
|
|
**Code locations:**
|
|
- `src/resources/extensions/sf/auto-prompts.js` (where prompts are loaded; add knowledge injection here)
|
|
- `src/resources/extensions/sf/prompts/execute-task.md`, `plan-slice.md` (templates that should reference {{knowledgeInjection}})
|
|
- New module: `src/resources/extensions/sf/knowledge-injector.ts` (semantic matching logic)
|
|
|
|
---
|
|
|
|
## Additional Improvements (Medium-Term, 1-2 Months)
|
|
|
|
### 4. Continuous Gate Pattern Aggregation [8/10 impact, 6/10 effort, 3-4 days]
|
|
After each phase, scan all gate failures for common themes. Aggregate into consolidated self-reports. Suggest architectural fixes.
|
|
|
|
### 5. Adaptive Timeout Tuning [7/10 impact, 6/10 effort, 3-4 days]
|
|
Replace hardcoded timeouts with data-driven values based on task execution history. Auto-adjust per task-type.
|
|
|
|
### 6. Hypothesis Testing Framework [9/10 impact, 7/10 effort, 4-5 days]
|
|
A/B test improvements on low-stakes tasks. Roll back if they introduce regressions. Never ship untested changes.
|
|
|
|
### 7. Cross-Milestone Federated Learning [8/10 impact, 9/10 effort, 8-10 days]
|
|
Share generalizable learnings across projects (same org). Test on similar projects first.
|
|
|
|
### 8. Regression Detection & Prevention [7/10 impact, 8/10 effort, 5-6 days]
|
|
Track metrics across milestones. Alert on regressions. Auto-rollback bad changes.
|
|
|
|
### 9. Semantic Drift Detection [6/10 impact, 7/10 effort, 4-5 days]
|
|
Detect when prompts/gate logic have drifted from original intent. Suggest reverting or documenting.
|
|
|
|
### 10. Self-Hosted Telemetry [5/10 impact, 8/10 effort, 4-5 days]
|
|
When SF runs on itself (dogfooding), profile which phases/gates take longest. Prioritize optimizations.
|
|
|
|
---
|
|
|
|
## Architecture Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ UOK Dispatch Loop │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
↓
|
|
┌────────────────────────┐
|
|
│ PhaseDiscuss/Plan/ │
|
|
│ Execute/Merge/ │
|
|
│ Complete │
|
|
└────────────────────────┘
|
|
↓ ↓
|
|
┌──────────────────┐ ┌──────────────────┐
|
|
│ Outcome │ │ Gates Run │
|
|
│ Logging │ │ Parallel │
|
|
└──────────────────┘ └──────────────────┘
|
|
↓ ↓
|
|
┌──────────────────────────────────────┐
|
|
│ sf.db: Outcome Ledger + Gate Results│
|
|
└──────────────────────────────────────┘
|
|
↓
|
|
┌──────────────────────────────────────────────┐
|
|
│ Self-Report Collection │
|
|
│ (agents + gates file anomalies) │
|
|
└──────────────────────────────────────────────┘
|
|
↓ ↓ ↓
|
|
┌────────┐ ┌────────────┐ ┌─────────────┐
|
|
│ TBD: │ │ Learning: │ │ Knowledge: │
|
|
│ Triage │ │ Model │ │ Compounding │
|
|
│ Loop │ │ Selection │ │ (KOWLEDGE.md)
|
|
└────────┘ └────────────┘ └─────────────┘
|
|
↓ ↓
|
|
┌─────────────────────────────────────────┐
|
|
│ TBD: Automated │
|
|
│ Knowledge Injection │
|
|
│ (into next dispatch) │
|
|
└─────────────────────────────────────────┘
|
|
|
|
(TBD = To Be Done; strikethrough items are implemented but inactive)
|
|
```
|
|
|
|
---
|
|
|
|
## How to Contribute
|
|
|
|
To improve self-evolution, pick one of the quick wins above:
|
|
|
|
1. **Study the code:** Understand how self-reports are filed, how outcome logging works, how KNOWLEDGE.md is structured
|
|
2. **Write a failing test:** Define expected behavior (e.g., "when self-report severity is 'blocker', it creates a backlog item")
|
|
3. **Implement the improvement:** Follow SF coding conventions (see CONTRIBUTING.md)
|
|
4. **Test thoroughly:** Especially recovery paths and edge cases
|
|
5. **Document:** Update this file and ARCHITECTURE.md as behavior changes
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **Outcome Learning:** `src/auto-dispatch.ts` (outcome logging), `packages/pi-ai/src/model-router.ts` (model selection)
|
|
- **Self-Reports:** `src/resources/extensions/sf/commands-handlers.js` (sf_self_report), `upstream-feedback.jsonl` (storage)
|
|
- **Knowledge:** `KNOWLEDGE.md` (storage), `src/resources/extensions/sf/memory-extractor.js` (extraction)
|
|
- **Gates:** `src/resources/extensions/sf/prompts/gate-evaluate.md` (gate orchestration)
|
|
- **TODO:** See `TODO.md` and `BACKLOG.md` for prioritized work
|