singularity-forge/TODO.md
2026-04-30 09:21:24 +02:00

10 KiB

TODO

Dump anything here.

SF agentic engineering / harness / memory / eval context dump:

We want a low-friction dump inbox that turns rough human notes into project evals, harness work, memory requirements, docs, tests, or implementation tasks. Root TODO.md is the dump place. AGENTS.md carries the durable instruction: agents should read TODO.md when present, triage it, and clear processed notes after converting them into reviewable artifacts.

Important split:

  • AGENTS.md = durable startup-visible instructions.
  • TODO.md = messy temporary dump inbox.
  • Memory = experience store.
  • GEPA/DSPy/self-evolution = offline lab.
  • Runtime agent = uses approved skills/prompts/tools/memory, not unreviewed evolved candidates.

Harness.io note:

  • Harness Agents are AI workers inside Harness CI/CD pipelines.
  • They inherit pipeline context, secrets, RBAC, approvals, logs, and OPA policy.
  • Useful SF lesson: run agents inside a governed workflow with permissions, logs, approvals, artifacts, reusable templates, and reviewable outputs.
  • This is different from repo-native test/eval harnesses, but the control-plane pattern is valuable.

Current SF state:

  • Auto-mode safety harness exists and is default-on: evidence collection, file-change validation, evidence cross-reference, destructive command warnings, content validation, checkpoints. Auto rollback is off by default.
  • gate-evaluate exists but is opt-in via gate_evaluation.enabled.
  • Repo-native harness evolution is mostly read-only/proposed today: /sf harness profile records repo facts in .sf/sf.db, but does not yet enforce harness/manifest gates or write harness/, gates/, eval suites, or CI files.

Slow conversion of TS into fast agents:

  • Do not rewrite the deterministic SF state machine into LLM behavior.
  • Keep TypeScript for CLI, TUI, extension API, preferences, state machine, DB schema, safety gates, prompt rendering, workflow orchestration, and file ownership rules.
  • Convert fuzzy/read-only work into narrow agents: repo profiling interpretation, TODO triage, eval generation, harness proposal, failure analysis, review, remediation proposals, memory extraction, drift detection.
  • SF remains the orchestrator and ledger. Agents consume typed jobs and return structured JSON.

Possible AgentJob shape:

type AgentJob = | { kind: "repo_profile"; cwd: string } | { kind: "todo_triage"; cwd: string; todoPath: string } | { kind: "eval_candidate_generation"; cwd: string; sources: string[] } | { kind: "failure_analysis"; cwd: string; runId: string } | { kind: "harness_proposal"; cwd: string; profileId: string };

First useful agents:

  • TODO triage agent: reads TODO.md, creates eval candidates, implementation tasks, memory facts, docs/harness suggestions, then clears processed notes.
  • Eval candidate agent: converts notes/session failures into JSONL with task_input, expected_behavior, failure_mode, evidence, source.
  • Repo profile interpretation agent: uses deterministic TS repo-profiler output and identifies missing gates/evals/docs.
  • Harness proposal agent: produces dry-run proposals only; no tracked file writes except reviewed artifacts later.
  • Remediation agent: later, after evals are stable, takes failing evals and proposes code/test patches.

Speed strategy:

  • Deterministic TS: scan files, parse manifests, read git state, write DB rows.
  • Cheap/local model agents: classify dump notes, summarize failures, label risk.
  • Strong model agents: propose harnesses, generate eval rubrics, repair complex failures.

Desired pipeline: TODO.md dump -> triage agent -> eval candidate JSONL / backlog / docs / tests -> reviewed project artifact -> eval suite / harness gate -> self-evolution can consume later.

Potential eval candidate JSONL shape:

{ "id": "sf.todo-triage.001", "task_input": "...", "expected_behavior": "...", "failure_mode": "...", "evidence": "...", "source": "TODO.md" }

Self-evolution principle:

  • Repeated failure -> add eval first, then fix behavior.
  • Raw memory/dump notes are evidence, not approved behavior.
  • GEPA/DSPy output must become reviewable diffs against skills/prompts/tool descriptions and pass held-out evals plus deterministic gates.

GEPA/DSPy placement across SF vs memory/brain:

  • GEPA/DSPy should not run inside normal SF runtime turns and should not live as direct mutable memory behavior.
  • SF owns the project workflow control plane: TODO triage, backlog handoff, eval artifacts, harness proposals, deterministic gates, reviewed diffs, and dispatch rules.
  • Memory/brain owns durable experience: session traces, user corrections, repeated failures, successful patterns, evidence IDs, source sessions, and recall/export APIs.
  • Memory/brain should expose dataset export surfaces for SF/self-evolution: "give me candidate eval cases for this repo/risk/skill/tool from past evidence".
  • GEPA/DSPy consumes approved eval datasets and memory-exported candidates offline, proposes prompt/skill/tool-description diffs, and hands those diffs back to SF as reviewable implementation work.
  • Accepted GEPA outputs become tracked repo artifacts or versioned SF resources, not raw memory entries.
  • Future home should be an offline evolution runner, either a separate repo such as singularity-evolution or a clearly isolated SF package/command such as packages/evolution plus /sf evolve .... It should read .sf/triage/evals/*.evals.jsonl, approved harness evals, and memory-exported eval candidates; run DSPy/GEPA; then write candidate diffs/reports under .sf/evolution/ or a review branch. It must not mutate live prompts, skills, memory, or tool descriptions directly.
  • End state: ACE Coder is the consolidation target for brain/memory, self-evolution, and agent workbench capabilities. It already has memory tiers and an evolution workspace, so it should eventually host the optimizer and long-running experiment service: consume SF eval artifacts and Singularity Memory exports, run GEPA/DSPy/genetic search, then return reports and candidate diffs to SF.
  • Near-term rule: keep execution in SF. ACE Coder can be the eventual consolidation target, but its execution loop is not as battle-tested as SF today. Start with SF's working tools, explicit artifacts, and deterministic gates; move capabilities behind stable contracts only after they are proven.
  • singularity-memory should migrate into ACE over time, but through a bridge rather than a wholesale copy. Keep the SF memory plugin contract stable, map Singularity Memory evidence/export APIs onto ACE memory concepts, compare quality/latency/operability, then swap the backend when ACE satisfies the contract.
  • Checked finding: Singularity Memory is the better current external brain contract for SF/Crush-style runners. It already has standalone MCP+HTTP, bank isolation, retain/recall/reflect, OpenAPI clients, thin tool adapters, VectorChord/BM25/RRF retrieval, optional reranking, and a Go migration path. ACE should eventually host this, but SF should keep targeting the Singularity Memory contract until ACE proves parity behind that same boundary.
  • Target topology: ACE is the central brain/workbench/evolution service; lightweight repo-local runners such as SF, Crush, or customer-approved agents run inside customer repositories. Those runners collect traces, triage TODO/self-report inputs, execute deterministic gates, and submit evidence/results back to ACE. ACE learns, evolves prompts/skills/tools offline, and returns reviewed candidate diffs or policies for the local runner to apply.
  • SF-to-Crush direction: preserve the parts of SF that are already working well--AGENTS/TODO triage, .sf/triage artifacts, backlog promotion, harness/eval gates, dispatch rules, and reviewable diffs--but make them usable from a Crush-style repo-local runner. In that shape, Crush is the customer-repo execution surface, SF is the workflow/gate library or adapter, and ACE Coder is the linked brain/workbench that stores memory, runs evolution, and sends back policies or candidate patches.
  • SF-to-vtcode/Rust direction: port the hot, deterministic SF pieces toward a Rust/vtcode-style core over time: repo scanning, artifact IO, dispatch state, gate execution, JSONL triage stores, and local runner protocol glue. Keep the current TS implementation as the working reference until the Rust path proves parity.
  • UX/runtime preference: keep Charm-style terminal UX where it adds operator clarity, and keep Crush in view as the fast repo-local execution surface. Rust/vtcode should optimize the core and protocol layer, not erase the good local workflow experience.
  • ACE creates/manages agents, memories, eval suites, skills, and policies. External/customer repos stay outside the ACE server boundary: repo-local runners own checkout access, file edits, tests, secrets exposure, and side effects, then report traces/results/artifacts back to ACE.

Proper info flow:

  • Raw human dump: root TODO.md.
  • Raw agent self-report: .sf/BACKLOG.md and ~/.sf/agent/upstream-feedback.jsonl.
  • Raw session-derived evidence: Singularity Memory / brain.
  • First normalizer: /sf todo triage for TODO.md now; future /sf inbox triage should normalize TODO.md + self-feedback + memory exports through the same schema.
  • Normalized pending items live in .sf/triage/inbox/*.jsonl with source, kind, evidence, status, and created_at.
  • Human-readable triage reports live in .sf/triage/reports/*.md.
  • Eval-ready cases live in .sf/triage/evals/*.evals.jsonl.
  • Human/planner-visible implementation tasks may be copied into .sf/BACKLOG.md with /sf todo triage --backlog, but auto-mode must not execute backlog directly. Planning/reassessment proposes promotion; user or explicit command approves promotion into roadmap/slice/task artifacts.
  • Memory-worthy notes are retained by memory/brain only after triage attaches evidence/source; raw TODO notes are not memory.
  • Preferred triage model tier: MiniMax M2.7 highspeed when available, then MiniMax M2.5 highspeed, then other cheap/fast classification models. Triage is structuring/classification, not final code editing.