singularity-forge/TODO.md

# TODO

Dump anything here.

SF agentic engineering / harness / memory / eval context dump:

We want a low-friction dump inbox that turns rough human notes into project
evals, harness work, memory requirements, docs, tests, or implementation tasks.
Root TODO.md is the dump place. AGENTS.md carries the durable instruction:
agents should read TODO.md when present, triage it, and clear processed notes
after converting them into reviewable artifacts.

Important split:
- AGENTS.md = durable startup-visible instructions.
- TODO.md = messy temporary dump inbox.
- Memory = experience store.
- GEPA/DSPy/self-evolution = offline lab.
- Runtime agent = uses approved skills/prompts/tools/memory, not unreviewed
  evolved candidates.

Harness.io note:
- Harness Agents are AI workers inside Harness CI/CD pipelines.
- They inherit pipeline context, secrets, RBAC, approvals, logs, and OPA policy.
- Useful SF lesson: run agents inside a governed workflow with permissions,
  logs, approvals, artifacts, reusable templates, and reviewable outputs.
- This is different from repo-native test/eval harnesses, but the control-plane
  pattern is valuable.

Current SF state:
- Auto-mode safety harness exists and is default-on: evidence collection,
  file-change validation, evidence cross-reference, destructive command
  warnings, content validation, checkpoints. Auto rollback is off by default.
- gate-evaluate exists but is opt-in via gate_evaluation.enabled.
- Repo-native harness evolution is mostly read-only/proposed today:
  /sf harness profile records repo facts in .sf/sf.db, but does not yet enforce
  harness/manifest gates or write harness/, gates/, eval suites, or CI files.

Slow conversion of TS into fast agents:
- Do not rewrite the deterministic SF state machine into LLM behavior.
- Keep TypeScript for CLI, TUI, extension API, preferences, state machine, DB
  schema, safety gates, prompt rendering, workflow orchestration, and file
  ownership rules.
- Convert fuzzy/read-only work into narrow agents: repo profiling
  interpretation, TODO triage, eval generation, harness proposal, failure
  analysis, review, remediation proposals, memory extraction, drift detection.
- SF remains the orchestrator and ledger. Agents consume typed jobs and return
  structured JSON.

Possible AgentJob shape:

type AgentJob =
  | { kind: "repo_profile"; cwd: string }
  | { kind: "todo_triage"; cwd: string; todoPath: string }
  | { kind: "eval_candidate_generation"; cwd: string; sources: string[] }
  | { kind: "failure_analysis"; cwd: string; runId: string }
  | { kind: "harness_proposal"; cwd: string; profileId: string };

First useful agents:
- TODO triage agent: reads TODO.md, creates eval candidates, implementation
  tasks, memory facts, docs/harness suggestions, then clears processed notes.
- Eval candidate agent: converts notes/session failures into JSONL with
  task_input, expected_behavior, failure_mode, evidence, source.
- Repo profile interpretation agent: uses deterministic TS repo-profiler output
  and identifies missing gates/evals/docs.
- Harness proposal agent: produces dry-run proposals only; no tracked file
  writes except reviewed artifacts later.
- Remediation agent: later, after evals are stable, takes failing evals and
  proposes code/test patches.

Speed strategy:
- Deterministic TS: scan files, parse manifests, read git state, write DB rows.
- Cheap/local model agents: classify dump notes, summarize failures, label risk.
- Strong model agents: propose harnesses, generate eval rubrics, repair complex
  failures.

Desired pipeline:
TODO.md dump -> triage agent -> eval candidate JSONL / backlog / docs / tests
-> reviewed project artifact -> eval suite / harness gate -> self-evolution
can consume later.

Potential eval candidate JSONL shape:

{
  "id": "sf.todo-triage.001",
  "task_input": "...",
  "expected_behavior": "...",
  "failure_mode": "...",
  "evidence": "...",
  "source": "TODO.md"
}

Self-evolution principle:
- Repeated failure -> add eval first, then fix behavior.
- Raw memory/dump notes are evidence, not approved behavior.
- GEPA/DSPy output must become reviewable diffs against skills/prompts/tool
  descriptions and pass held-out evals plus deterministic gates.

GEPA/DSPy placement across SF vs memory/brain:
- GEPA/DSPy should not run inside normal SF runtime turns and should not live
  as direct mutable memory behavior.
- SF owns the project workflow control plane: TODO triage, backlog handoff,
  eval artifacts, harness proposals, deterministic gates, reviewed diffs, and
  dispatch rules.
- Memory/brain owns durable experience: session traces, user corrections,
  repeated failures, successful patterns, evidence IDs, source sessions, and
  recall/export APIs.
- Memory/brain should expose dataset export surfaces for SF/self-evolution:
  "give me candidate eval cases for this repo/risk/skill/tool from past
  evidence".
- GEPA/DSPy consumes approved eval datasets and memory-exported candidates
  offline, proposes prompt/skill/tool-description diffs, and hands those diffs
  back to SF as reviewable implementation work.
- Accepted GEPA outputs become tracked repo artifacts or versioned SF resources,
  not raw memory entries.
- Future home should be an offline evolution runner, either a separate repo
  such as `singularity-evolution` or a clearly isolated SF package/command such
  as `packages/evolution` plus `/sf evolve ...`. It should read
  `.sf/triage/evals/*.evals.jsonl`, approved harness evals, and memory-exported
  eval candidates; run DSPy/GEPA; then write candidate diffs/reports under
  `.sf/evolution/` or a review branch. It must not mutate live prompts,
  skills, memory, or tool descriptions directly.
- End state: ACE Coder is the consolidation target for brain/memory,
  self-evolution, and agent workbench capabilities. It already has memory tiers
  and an evolution workspace, so it should eventually host the optimizer and
  long-running experiment service: consume SF eval artifacts and Singularity
  Memory exports, run GEPA/DSPy/genetic search, then return reports and
  candidate diffs to SF.
- Near-term rule: keep execution in SF. ACE Coder can be the eventual
  consolidation target, but its execution loop is not as battle-tested as SF
  today. Start with SF's working tools, explicit artifacts, and deterministic
  gates; move capabilities behind stable contracts only after they are proven.
- `singularity-memory` should migrate into ACE over time, but through a bridge
  rather than a wholesale copy. Keep the SF memory plugin contract stable, map
  Singularity Memory evidence/export APIs onto ACE memory concepts, compare
  quality/latency/operability, then swap the backend when ACE satisfies the
  contract.
- Checked finding: Singularity Memory is the better current external brain
  contract for SF/Crush-style runners. It already has standalone MCP+HTTP,
  bank isolation, retain/recall/reflect, OpenAPI clients, thin tool adapters,
  VectorChord/BM25/RRF retrieval, optional reranking, and a Go migration path.
  ACE should eventually host this, but SF should keep targeting the
  Singularity Memory contract until ACE proves parity behind that same
  boundary.
- Target topology: ACE is the central brain/workbench/evolution service;
  lightweight repo-local runners such as SF, Crush, or customer-approved
  agents run inside customer repositories. Those runners collect traces,
  triage TODO/self-report inputs, execute deterministic gates, and submit
  evidence/results back to ACE. ACE learns, evolves prompts/skills/tools
  offline, and returns reviewed candidate diffs or policies for the local
  runner to apply.
- SF-to-Crush direction: preserve the parts of SF that are already working
  well--AGENTS/TODO triage, `.sf/triage` artifacts, backlog promotion,
  harness/eval gates, dispatch rules, and reviewable diffs--but make them
  usable from a Crush-style repo-local runner. In that shape, Crush is the
  customer-repo execution surface, SF is the workflow/gate library or adapter,
  and ACE Coder is the linked brain/workbench that stores memory, runs
  evolution, and sends back policies or candidate patches.
- SF-to-vtcode/Rust direction: port the hot, deterministic SF pieces toward a
  Rust/vtcode-style core over time: repo scanning, artifact IO, dispatch state,
  gate execution, JSONL triage stores, and local runner protocol glue. Keep the
  current TS implementation as the working reference until the Rust path proves
  parity.
- UX/runtime preference: keep Charm-style terminal UX where it adds operator
  clarity, and keep Crush in view as the fast repo-local execution surface.
  Rust/vtcode should optimize the core and protocol layer, not erase the good
  local workflow experience.
- ACE creates/manages agents, memories, eval suites, skills, and policies.
  External/customer repos stay outside the ACE server boundary: repo-local
  runners own checkout access, file edits, tests, secrets exposure, and side
  effects, then report traces/results/artifacts back to ACE.

Proper info flow:
- Raw human dump: root TODO.md.
- Raw agent self-report: .sf/BACKLOG.md and ~/.sf/agent/upstream-feedback.jsonl.
- Raw session-derived evidence: Singularity Memory / brain.
- First normalizer: /sf todo triage for TODO.md now; future /sf inbox triage
  should normalize TODO.md + self-feedback + memory exports through the same
  schema.
- Normalized pending items live in .sf/triage/inbox/*.jsonl with source, kind,
  evidence, status, and created_at.
- Human-readable triage reports live in .sf/triage/reports/*.md.
- Eval-ready cases live in .sf/triage/evals/*.evals.jsonl.
- Human/planner-visible implementation tasks may be copied into .sf/BACKLOG.md
  with /sf todo triage --backlog, but auto-mode must not execute backlog
  directly. Planning/reassessment proposes promotion; user or explicit command
  approves promotion into roadmap/slice/task artifacts.
- Memory-worthy notes are retained by memory/brain only after triage attaches
  evidence/source; raw TODO notes are not memory.
- Preferred triage model tier: MiniMax M2.7 highspeed when available, then
  MiniMax M2.5 highspeed, then other cheap/fast classification models. Triage
  is structuring/classification, not final code editing.