Mikael Hugo 6471e10245 sf snapshot: uncommitted changes after 64m inactivity

2026-05-06 16:28:31 +02:00

14 KiB

Raw Blame History

UOK Self-Evolution Architecture

This document explains how Singularity Forge's UOK (Unified Operation Kernel) implements self-evolution — the ability to detect its own failures, learn from them, and improve its own heuristics and dispatch logic.

Status Summary

Current state: 60-70% complete. Infrastructure exists; learning loop not fully activated.

What works:

✅ Self-report collection during dispatch/validation
✅ Outcome learning for model selection (Bayesian)
✅ Knowledge compounding (KNOWLEDGE.md)
✅ Gate-based pattern detection capability

What's missing:

❌ Automated triage pipeline (reports not processed into fixes)
❌ Continuous model tuning (learning episodic, not aggressive)
❌ Automated knowledge injection (knowledge not used in prompts)
❌ Cross-gate pattern aggregation (gates run independently)
❌ Adaptive thresholds (timeouts hardcoded, not data-driven)
❌ Hypothesis testing (no A/B test framework for improvements)
❌ Regression detection (no metrics monitoring)

1. Self-Report Collection

Purpose: Capture SF-internal observations when something unexpected happens during dispatch, validation, or gate runs.

Implementation: Agents and gates call sf_self_report(issue, severity, context)

Examples of self-reports filed in production:

"validation-reviewer prompt lacks explicit rubric for criterion vs. implementation gap" [low]
"model timeout on large input (>100K tokens)" [warning]
"gate inconsistency: requirement-coverage gate failed in S02 but passed in S03 for same requirement" [warning]

Storage:

Runtime: ~/.sf/agent/upstream-feedback.jsonl (per-session)
Dogfooding: .sf/SELF-FEEDBACK.md (when SF runs on itself)

Current limitation:

Reports are collected but not systematically processed
No automatic triage, dedup, or promotion to code fixes
No feedback-to-code pipeline

Improvement: See "Top 3 Quick Wins" below.

2. Outcome Learning for Model Selection

Purpose: Track which models succeed/fail on different task types, then route future tasks to higher-scoring models.

Mechanism:

UOK maintains a per-task-type model performance matrix:

model_scores[task_type] = {
  claude-sonnet: { successes: 42, failures: 3, latency_ms: [450, 520, ...] },
  claude-opus: { successes: 8, failures: 2, latency_ms: [800, 850, ...] },
  minimax: { successes: 15, failures: 10, latency_ms: [350, 400, ...] }
}

Bayesian update after each task:

P(model_i succeeds | task_type) = (successes + prior_weight) / (total_trials + prior_weight)

Default priors give new/experimental models benefit of the doubt
Different priors for different model classes (Claude gets higher prior than experimental)
Used by benchmark-selector.ts to pick the best model for next task

Location: Computed during phase transitions; stored in sf.db outcome logs.

Current limitations:

Learning updates episodically (per-task completion), not continuously
Success/failure is binary — doesn't distinguish "slow success" from "fast success"
Only applied to task dispatch, not to gate routing
Recovery paths don't feed learning back

Improvement: Make learning aggressive per-task-type with latency/cost tracking.

3. Knowledge Compounding

Purpose: Extract high-confidence learnings from completed work and make them available to future milestones.

Storage: KNOWLEDGE.md with structured judgment-log entries.

Format:

## [2026-05-06] Python 3.12 stdlib compatibility

**Verdict:** Active issue — avoid for now

**Evidence:** Task T02 in M010/S03 discovered that `asyncore` module removed in Python 3.12. Affects legacy integrations.

**Confidence:** 0.95 (observed failure in live deployment)

**Recommendation:** Constrain to Python <3.11 in requirements.txt; add explicit warning for users on 3.12.

How it should work (ideally):

After slice completion, memory-extractor.ts distills high-confidence learnings (confidence >0.8)
Next milestone dispatch checks KNOWLEDGE.md for relevance
Relevant knowledge injected into dispatch prompts automatically
Contradictory knowledge flagged as potential architectural drift

Current state:

Storage works (KNOWLEDGE.md well-formatted)
Extraction works (memory-extractor.ts analyzes task results)
Injection is manual (must explicitly configure in prompts)
No automatic relevance matching
No conflict resolution for contradictory knowledge

Improvement: Automate knowledge injection with semantic relevance scoring.

4. Gate-Based Pattern Detection

Purpose: Gates can detect and report repeated failure patterns, signaling potential design flaws.

Example:

Requirement-coverage gate fails in S01 (requirement X not covered)
Requirement-coverage gate fails in S03 (same requirement X not covered)
Gate files self-report: "Requirement X failed coverage in multiple slices — suggests design flaw or missing slice"

Current state:

Logic exists in individual gates
Each gate runs independently (no cross-gate pattern aggregation)
Patterns must be explicitly coded in gate logic (not automatic)
No framework for gate authors to easily add pattern detection

Improvement: Add cross-gate pattern aggregation + automatic theme detection.

Top 3 Quick Wins (8-10 Days Total)

1. Close Self-Report Feedback Loop [9/10 impact, 4/10 effort, 2-3 days]

What: Create an automated triage pipeline that processes self-reports into actionable fixes.

Implementation:

Extend commands-todo.js triage logic to parse upstream-feedback.jsonl
Triage rules:
- Dedup identical reports (same issue filed multiple times)
- Classify by severity: blocker | warning | suggestion
- Auto-create backlog work items for blockers/warnings
- For high-confidence fixes (e.g., "prompt lacks rubric"), propose the fix directly
Promote fixes into code via new SF slice

Why: Reports are collected but ignored. This closes the feedback loop.

Code locations:

src/resources/extensions/sf/commands-handlers.js (sf_self_report implementation)
src/resources/extensions/sf/commands/handlers/todo.js (triage logic to extend)
src/resources/extensions/sf/commands-todo.js (triage prompt + tool)

2. Activate Continuous Model Learning [8/10 impact, 5/10 effort, 3-4 days]

What: Make model selection adaptive, per-task-type, with failure analysis and automatic demotion.

Current state:

Outcome tracking exists but learning is infrequent
All model routing decisions are mostly static (based on configuration, not history)

Improvements:

Track per-task-type: success rate, latency, cost, token efficiency
Auto-demote models that fail >50% on specific task types
A/B test new models against incumbent on low-risk tasks
Log detailed failure analytics (why did this model fail? timeouts? quality?)

Why: Learning exists but is dormant; this makes dispatch adaptive.

Code locations:

packages/pi-ai/src/model-router.ts (model selection logic)
src/auto-dispatch.ts (outcome logging, task tracking)
src/resources/extensions/sf/commands/benchmark-selector.ts (model scoring display)

3. Automate Knowledge Injection [7/10 impact, 4/10 effort, 2-3 days]

What: During milestone planning, automatically query KNOWLEDGE.md for relevant learnings and inject them into dispatch prompts.

Current state:

KNOWLEDGE.md exists and is populated
Agents never see it (unless manually configured)

Improvements:

At planning time, query KNOWLEDGE.md with semantic similarity scoring
Inject high-confidence (>0.8) relevant knowledge into execute-task, plan-slice prompts
Flag contradictory knowledge (e.g., "avoid Python 3.12" vs. "adopt Python 3.12") for review
Track which knowledge was actually used (feedback to knowledge compounding)

Why: Knowledge exists but isn't used; this makes it actionable.

Code locations:

src/resources/extensions/sf/auto-prompts.js (where prompts are loaded; add knowledge injection here)
src/resources/extensions/sf/prompts/execute-task.md, plan-slice.md (templates that should reference {{knowledgeInjection}})
New module: src/resources/extensions/sf/knowledge-injector.ts (semantic matching logic)

Additional Improvements (Medium-Term, 1-2 Months)

4. Continuous Gate Pattern Aggregation [8/10 impact, 6/10 effort, 3-4 days]

After each phase, scan all gate failures for common themes. Aggregate into consolidated self-reports. Suggest architectural fixes.

5. Adaptive Timeout Tuning [7/10 impact, 6/10 effort, 3-4 days]

Replace hardcoded timeouts with data-driven values based on task execution history. Auto-adjust per task-type.

6. Hypothesis Testing Framework [9/10 impact, 7/10 effort, 4-5 days]

A/B test improvements on low-stakes tasks. Roll back if they introduce regressions. Never ship untested changes.

7. Cross-Milestone Federated Learning [8/10 impact, 9/10 effort, 8-10 days]

Share generalizable learnings across projects (same org). Test on similar projects first.

8. Regression Detection & Prevention [7/10 impact, 8/10 effort, 5-6 days]

Track metrics across milestones. Alert on regressions. Auto-rollback bad changes.

9. Semantic Drift Detection [6/10 impact, 7/10 effort, 4-5 days]

Detect when prompts/gate logic have drifted from original intent. Suggest reverting or documenting.

10. Self-Hosted Telemetry [5/10 impact, 8/10 effort, 4-5 days]

When SF runs on itself (dogfooding), profile which phases/gates take longest. Prioritize optimizations.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                      UOK Dispatch Loop                          │
└─────────────────────────────────────────────────────────────────┘
                                ↓
                   ┌────────────────────────┐
                   │  PhaseDiscuss/Plan/    │
                   │  Execute/Merge/        │
                   │  Complete              │
                   └────────────────────────┘
                   ↓                    ↓
         ┌──────────────────┐  ┌──────────────────┐
         │  Outcome         │  │  Gates Run       │
         │  Logging         │  │  Parallel        │
         └──────────────────┘  └──────────────────┘
                   ↓                    ↓
         ┌──────────────────────────────────────┐
         │  sf.db: Outcome Ledger + Gate Results│
         └──────────────────────────────────────┘
                   ↓
    ┌──────────────────────────────────────────────┐
    │      Self-Report Collection                  │
    │  (agents + gates file anomalies)             │
    └──────────────────────────────────────────────┘
         ↓           ↓           ↓
    ┌────────┐  ┌────────────┐  ┌─────────────┐
    │ TBD:   │  │ Learning:  │  │ Knowledge:  │
    │ Triage │  │ Model      │  │ Compounding │
    │ Loop   │  │ Selection  │  │ (KOWLEDGE.md)
    └────────┘  └────────────┘  └─────────────┘
         ↓                            ↓
    ┌─────────────────────────────────────────┐
    │  TBD: Automated                         │
    │  Knowledge Injection                    │
    │  (into next dispatch)                   │
    └─────────────────────────────────────────┘

(TBD = To Be Done; strikethrough items are implemented but inactive)

How to Contribute

To improve self-evolution, pick one of the quick wins above:

Study the code: Understand how self-reports are filed, how outcome logging works, how KNOWLEDGE.md is structured
Write a failing test: Define expected behavior (e.g., "when self-report severity is 'blocker', it creates a backlog item")
Implement the improvement: Follow SF coding conventions (see CONTRIBUTING.md)
Test thoroughly: Especially recovery paths and edge cases
Document: Update this file and ARCHITECTURE.md as behavior changes

References

Outcome Learning: src/auto-dispatch.ts (outcome logging), packages/pi-ai/src/model-router.ts (model selection)
Self-Reports: src/resources/extensions/sf/commands-handlers.js (sf_self_report), upstream-feedback.jsonl (storage)
Knowledge: KNOWLEDGE.md (storage), src/resources/extensions/sf/memory-extractor.js (extraction)
Gates: src/resources/extensions/sf/prompts/gate-evaluate.md (gate orchestration)
TODO: See TODO.md and BACKLOG.md for prioritized work

14 KiB Raw Blame History

UOK Self-Evolution Architecture

Status Summary

1. Self-Report Collection

2. Outcome Learning for Model Selection

3. Knowledge Compounding

4. Gate-Based Pattern Detection

Top 3 Quick Wins (8-10 Days Total)

1. Close Self-Report Feedback Loop [9/10 impact, 4/10 effort, 2-3 days]

2. Activate Continuous Model Learning [8/10 impact, 5/10 effort, 3-4 days]

3. Automate Knowledge Injection [7/10 impact, 4/10 effort, 2-3 days]

Additional Improvements (Medium-Term, 1-2 Months)

4. Continuous Gate Pattern Aggregation [8/10 impact, 6/10 effort, 3-4 days]

5. Adaptive Timeout Tuning [7/10 impact, 6/10 effort, 3-4 days]

6. Hypothesis Testing Framework [9/10 impact, 7/10 effort, 4-5 days]

7. Cross-Milestone Federated Learning [8/10 impact, 9/10 effort, 8-10 days]

8. Regression Detection & Prevention [7/10 impact, 8/10 effort, 5-6 days]

9. Semantic Drift Detection [6/10 impact, 7/10 effort, 4-5 days]

10. Self-Hosted Telemetry [5/10 impact, 8/10 effort, 4-5 days]

Architecture Diagram

How to Contribute

References

14 KiB

Raw Blame History