singularity-forge/TODO.md
2026-04-30 09:21:24 +02:00

191 lines
10 KiB
Markdown

# TODO
Dump anything here.
SF agentic engineering / harness / memory / eval context dump:
We want a low-friction dump inbox that turns rough human notes into project
evals, harness work, memory requirements, docs, tests, or implementation tasks.
Root TODO.md is the dump place. AGENTS.md carries the durable instruction:
agents should read TODO.md when present, triage it, and clear processed notes
after converting them into reviewable artifacts.
Important split:
- AGENTS.md = durable startup-visible instructions.
- TODO.md = messy temporary dump inbox.
- Memory = experience store.
- GEPA/DSPy/self-evolution = offline lab.
- Runtime agent = uses approved skills/prompts/tools/memory, not unreviewed
evolved candidates.
Harness.io note:
- Harness Agents are AI workers inside Harness CI/CD pipelines.
- They inherit pipeline context, secrets, RBAC, approvals, logs, and OPA policy.
- Useful SF lesson: run agents inside a governed workflow with permissions,
logs, approvals, artifacts, reusable templates, and reviewable outputs.
- This is different from repo-native test/eval harnesses, but the control-plane
pattern is valuable.
Current SF state:
- Auto-mode safety harness exists and is default-on: evidence collection,
file-change validation, evidence cross-reference, destructive command
warnings, content validation, checkpoints. Auto rollback is off by default.
- gate-evaluate exists but is opt-in via gate_evaluation.enabled.
- Repo-native harness evolution is mostly read-only/proposed today:
/sf harness profile records repo facts in .sf/sf.db, but does not yet enforce
harness/manifest gates or write harness/, gates/, eval suites, or CI files.
Slow conversion of TS into fast agents:
- Do not rewrite the deterministic SF state machine into LLM behavior.
- Keep TypeScript for CLI, TUI, extension API, preferences, state machine, DB
schema, safety gates, prompt rendering, workflow orchestration, and file
ownership rules.
- Convert fuzzy/read-only work into narrow agents: repo profiling
interpretation, TODO triage, eval generation, harness proposal, failure
analysis, review, remediation proposals, memory extraction, drift detection.
- SF remains the orchestrator and ledger. Agents consume typed jobs and return
structured JSON.
Possible AgentJob shape:
type AgentJob =
| { kind: "repo_profile"; cwd: string }
| { kind: "todo_triage"; cwd: string; todoPath: string }
| { kind: "eval_candidate_generation"; cwd: string; sources: string[] }
| { kind: "failure_analysis"; cwd: string; runId: string }
| { kind: "harness_proposal"; cwd: string; profileId: string };
First useful agents:
- TODO triage agent: reads TODO.md, creates eval candidates, implementation
tasks, memory facts, docs/harness suggestions, then clears processed notes.
- Eval candidate agent: converts notes/session failures into JSONL with
task_input, expected_behavior, failure_mode, evidence, source.
- Repo profile interpretation agent: uses deterministic TS repo-profiler output
and identifies missing gates/evals/docs.
- Harness proposal agent: produces dry-run proposals only; no tracked file
writes except reviewed artifacts later.
- Remediation agent: later, after evals are stable, takes failing evals and
proposes code/test patches.
Speed strategy:
- Deterministic TS: scan files, parse manifests, read git state, write DB rows.
- Cheap/local model agents: classify dump notes, summarize failures, label risk.
- Strong model agents: propose harnesses, generate eval rubrics, repair complex
failures.
Desired pipeline:
TODO.md dump -> triage agent -> eval candidate JSONL / backlog / docs / tests
-> reviewed project artifact -> eval suite / harness gate -> self-evolution
can consume later.
Potential eval candidate JSONL shape:
{
"id": "sf.todo-triage.001",
"task_input": "...",
"expected_behavior": "...",
"failure_mode": "...",
"evidence": "...",
"source": "TODO.md"
}
Self-evolution principle:
- Repeated failure -> add eval first, then fix behavior.
- Raw memory/dump notes are evidence, not approved behavior.
- GEPA/DSPy output must become reviewable diffs against skills/prompts/tool
descriptions and pass held-out evals plus deterministic gates.
GEPA/DSPy placement across SF vs memory/brain:
- GEPA/DSPy should not run inside normal SF runtime turns and should not live
as direct mutable memory behavior.
- SF owns the project workflow control plane: TODO triage, backlog handoff,
eval artifacts, harness proposals, deterministic gates, reviewed diffs, and
dispatch rules.
- Memory/brain owns durable experience: session traces, user corrections,
repeated failures, successful patterns, evidence IDs, source sessions, and
recall/export APIs.
- Memory/brain should expose dataset export surfaces for SF/self-evolution:
"give me candidate eval cases for this repo/risk/skill/tool from past
evidence".
- GEPA/DSPy consumes approved eval datasets and memory-exported candidates
offline, proposes prompt/skill/tool-description diffs, and hands those diffs
back to SF as reviewable implementation work.
- Accepted GEPA outputs become tracked repo artifacts or versioned SF resources,
not raw memory entries.
- Future home should be an offline evolution runner, either a separate repo
such as `singularity-evolution` or a clearly isolated SF package/command such
as `packages/evolution` plus `/sf evolve ...`. It should read
`.sf/triage/evals/*.evals.jsonl`, approved harness evals, and memory-exported
eval candidates; run DSPy/GEPA; then write candidate diffs/reports under
`.sf/evolution/` or a review branch. It must not mutate live prompts,
skills, memory, or tool descriptions directly.
- End state: ACE Coder is the consolidation target for brain/memory,
self-evolution, and agent workbench capabilities. It already has memory tiers
and an evolution workspace, so it should eventually host the optimizer and
long-running experiment service: consume SF eval artifacts and Singularity
Memory exports, run GEPA/DSPy/genetic search, then return reports and
candidate diffs to SF.
- Near-term rule: keep execution in SF. ACE Coder can be the eventual
consolidation target, but its execution loop is not as battle-tested as SF
today. Start with SF's working tools, explicit artifacts, and deterministic
gates; move capabilities behind stable contracts only after they are proven.
- `singularity-memory` should migrate into ACE over time, but through a bridge
rather than a wholesale copy. Keep the SF memory plugin contract stable, map
Singularity Memory evidence/export APIs onto ACE memory concepts, compare
quality/latency/operability, then swap the backend when ACE satisfies the
contract.
- Checked finding: Singularity Memory is the better current external brain
contract for SF/Crush-style runners. It already has standalone MCP+HTTP,
bank isolation, retain/recall/reflect, OpenAPI clients, thin tool adapters,
VectorChord/BM25/RRF retrieval, optional reranking, and a Go migration path.
ACE should eventually host this, but SF should keep targeting the
Singularity Memory contract until ACE proves parity behind that same
boundary.
- Target topology: ACE is the central brain/workbench/evolution service;
lightweight repo-local runners such as SF, Crush, or customer-approved
agents run inside customer repositories. Those runners collect traces,
triage TODO/self-report inputs, execute deterministic gates, and submit
evidence/results back to ACE. ACE learns, evolves prompts/skills/tools
offline, and returns reviewed candidate diffs or policies for the local
runner to apply.
- SF-to-Crush direction: preserve the parts of SF that are already working
well--AGENTS/TODO triage, `.sf/triage` artifacts, backlog promotion,
harness/eval gates, dispatch rules, and reviewable diffs--but make them
usable from a Crush-style repo-local runner. In that shape, Crush is the
customer-repo execution surface, SF is the workflow/gate library or adapter,
and ACE Coder is the linked brain/workbench that stores memory, runs
evolution, and sends back policies or candidate patches.
- SF-to-vtcode/Rust direction: port the hot, deterministic SF pieces toward a
Rust/vtcode-style core over time: repo scanning, artifact IO, dispatch state,
gate execution, JSONL triage stores, and local runner protocol glue. Keep the
current TS implementation as the working reference until the Rust path proves
parity.
- UX/runtime preference: keep Charm-style terminal UX where it adds operator
clarity, and keep Crush in view as the fast repo-local execution surface.
Rust/vtcode should optimize the core and protocol layer, not erase the good
local workflow experience.
- ACE creates/manages agents, memories, eval suites, skills, and policies.
External/customer repos stay outside the ACE server boundary: repo-local
runners own checkout access, file edits, tests, secrets exposure, and side
effects, then report traces/results/artifacts back to ACE.
Proper info flow:
- Raw human dump: root TODO.md.
- Raw agent self-report: .sf/BACKLOG.md and ~/.sf/agent/upstream-feedback.jsonl.
- Raw session-derived evidence: Singularity Memory / brain.
- First normalizer: /sf todo triage for TODO.md now; future /sf inbox triage
should normalize TODO.md + self-feedback + memory exports through the same
schema.
- Normalized pending items live in .sf/triage/inbox/*.jsonl with source, kind,
evidence, status, and created_at.
- Human-readable triage reports live in .sf/triage/reports/*.md.
- Eval-ready cases live in .sf/triage/evals/*.evals.jsonl.
- Human/planner-visible implementation tasks may be copied into .sf/BACKLOG.md
with /sf todo triage --backlog, but auto-mode must not execute backlog
directly. Planning/reassessment proposes promotion; user or explicit command
approves promotion into roadmap/slice/task artifacts.
- Memory-worthy notes are retained by memory/brain only after triage attaches
evidence/source; raw TODO notes are not memory.
- Preferred triage model tier: MiniMax M2.7 highspeed when available, then
MiniMax M2.5 highspeed, then other cheap/fast classification models. Triage
is structuring/classification, not final code editing.