191 lines
10 KiB
Markdown
191 lines
10 KiB
Markdown
# TODO
|
|
|
|
Dump anything here.
|
|
|
|
SF agentic engineering / harness / memory / eval context dump:
|
|
|
|
We want a low-friction dump inbox that turns rough human notes into project
|
|
evals, harness work, memory requirements, docs, tests, or implementation tasks.
|
|
Root TODO.md is the dump place. AGENTS.md carries the durable instruction:
|
|
agents should read TODO.md when present, triage it, and clear processed notes
|
|
after converting them into reviewable artifacts.
|
|
|
|
Important split:
|
|
- AGENTS.md = durable startup-visible instructions.
|
|
- TODO.md = messy temporary dump inbox.
|
|
- Memory = experience store.
|
|
- GEPA/DSPy/self-evolution = offline lab.
|
|
- Runtime agent = uses approved skills/prompts/tools/memory, not unreviewed
|
|
evolved candidates.
|
|
|
|
Harness.io note:
|
|
- Harness Agents are AI workers inside Harness CI/CD pipelines.
|
|
- They inherit pipeline context, secrets, RBAC, approvals, logs, and OPA policy.
|
|
- Useful SF lesson: run agents inside a governed workflow with permissions,
|
|
logs, approvals, artifacts, reusable templates, and reviewable outputs.
|
|
- This is different from repo-native test/eval harnesses, but the control-plane
|
|
pattern is valuable.
|
|
|
|
Current SF state:
|
|
- Auto-mode safety harness exists and is default-on: evidence collection,
|
|
file-change validation, evidence cross-reference, destructive command
|
|
warnings, content validation, checkpoints. Auto rollback is off by default.
|
|
- gate-evaluate exists but is opt-in via gate_evaluation.enabled.
|
|
- Repo-native harness evolution is mostly read-only/proposed today:
|
|
/sf harness profile records repo facts in .sf/sf.db, but does not yet enforce
|
|
harness/manifest gates or write harness/, gates/, eval suites, or CI files.
|
|
|
|
Slow conversion of TS into fast agents:
|
|
- Do not rewrite the deterministic SF state machine into LLM behavior.
|
|
- Keep TypeScript for CLI, TUI, extension API, preferences, state machine, DB
|
|
schema, safety gates, prompt rendering, workflow orchestration, and file
|
|
ownership rules.
|
|
- Convert fuzzy/read-only work into narrow agents: repo profiling
|
|
interpretation, TODO triage, eval generation, harness proposal, failure
|
|
analysis, review, remediation proposals, memory extraction, drift detection.
|
|
- SF remains the orchestrator and ledger. Agents consume typed jobs and return
|
|
structured JSON.
|
|
|
|
Possible AgentJob shape:
|
|
|
|
type AgentJob =
|
|
| { kind: "repo_profile"; cwd: string }
|
|
| { kind: "todo_triage"; cwd: string; todoPath: string }
|
|
| { kind: "eval_candidate_generation"; cwd: string; sources: string[] }
|
|
| { kind: "failure_analysis"; cwd: string; runId: string }
|
|
| { kind: "harness_proposal"; cwd: string; profileId: string };
|
|
|
|
First useful agents:
|
|
- TODO triage agent: reads TODO.md, creates eval candidates, implementation
|
|
tasks, memory facts, docs/harness suggestions, then clears processed notes.
|
|
- Eval candidate agent: converts notes/session failures into JSONL with
|
|
task_input, expected_behavior, failure_mode, evidence, source.
|
|
- Repo profile interpretation agent: uses deterministic TS repo-profiler output
|
|
and identifies missing gates/evals/docs.
|
|
- Harness proposal agent: produces dry-run proposals only; no tracked file
|
|
writes except reviewed artifacts later.
|
|
- Remediation agent: later, after evals are stable, takes failing evals and
|
|
proposes code/test patches.
|
|
|
|
Speed strategy:
|
|
- Deterministic TS: scan files, parse manifests, read git state, write DB rows.
|
|
- Cheap/local model agents: classify dump notes, summarize failures, label risk.
|
|
- Strong model agents: propose harnesses, generate eval rubrics, repair complex
|
|
failures.
|
|
|
|
Desired pipeline:
|
|
TODO.md dump -> triage agent -> eval candidate JSONL / backlog / docs / tests
|
|
-> reviewed project artifact -> eval suite / harness gate -> self-evolution
|
|
can consume later.
|
|
|
|
Potential eval candidate JSONL shape:
|
|
|
|
{
|
|
"id": "sf.todo-triage.001",
|
|
"task_input": "...",
|
|
"expected_behavior": "...",
|
|
"failure_mode": "...",
|
|
"evidence": "...",
|
|
"source": "TODO.md"
|
|
}
|
|
|
|
Self-evolution principle:
|
|
- Repeated failure -> add eval first, then fix behavior.
|
|
- Raw memory/dump notes are evidence, not approved behavior.
|
|
- GEPA/DSPy output must become reviewable diffs against skills/prompts/tool
|
|
descriptions and pass held-out evals plus deterministic gates.
|
|
|
|
GEPA/DSPy placement across SF vs memory/brain:
|
|
- GEPA/DSPy should not run inside normal SF runtime turns and should not live
|
|
as direct mutable memory behavior.
|
|
- SF owns the project workflow control plane: TODO triage, backlog handoff,
|
|
eval artifacts, harness proposals, deterministic gates, reviewed diffs, and
|
|
dispatch rules.
|
|
- Memory/brain owns durable experience: session traces, user corrections,
|
|
repeated failures, successful patterns, evidence IDs, source sessions, and
|
|
recall/export APIs.
|
|
- Memory/brain should expose dataset export surfaces for SF/self-evolution:
|
|
"give me candidate eval cases for this repo/risk/skill/tool from past
|
|
evidence".
|
|
- GEPA/DSPy consumes approved eval datasets and memory-exported candidates
|
|
offline, proposes prompt/skill/tool-description diffs, and hands those diffs
|
|
back to SF as reviewable implementation work.
|
|
- Accepted GEPA outputs become tracked repo artifacts or versioned SF resources,
|
|
not raw memory entries.
|
|
- Future home should be an offline evolution runner, either a separate repo
|
|
such as `singularity-evolution` or a clearly isolated SF package/command such
|
|
as `packages/evolution` plus `/sf evolve ...`. It should read
|
|
`.sf/triage/evals/*.evals.jsonl`, approved harness evals, and memory-exported
|
|
eval candidates; run DSPy/GEPA; then write candidate diffs/reports under
|
|
`.sf/evolution/` or a review branch. It must not mutate live prompts,
|
|
skills, memory, or tool descriptions directly.
|
|
- End state: ACE Coder is the consolidation target for brain/memory,
|
|
self-evolution, and agent workbench capabilities. It already has memory tiers
|
|
and an evolution workspace, so it should eventually host the optimizer and
|
|
long-running experiment service: consume SF eval artifacts and Singularity
|
|
Memory exports, run GEPA/DSPy/genetic search, then return reports and
|
|
candidate diffs to SF.
|
|
- Near-term rule: keep execution in SF. ACE Coder can be the eventual
|
|
consolidation target, but its execution loop is not as battle-tested as SF
|
|
today. Start with SF's working tools, explicit artifacts, and deterministic
|
|
gates; move capabilities behind stable contracts only after they are proven.
|
|
- `singularity-memory` should migrate into ACE over time, but through a bridge
|
|
rather than a wholesale copy. Keep the SF memory plugin contract stable, map
|
|
Singularity Memory evidence/export APIs onto ACE memory concepts, compare
|
|
quality/latency/operability, then swap the backend when ACE satisfies the
|
|
contract.
|
|
- Checked finding: Singularity Memory is the better current external brain
|
|
contract for SF/Crush-style runners. It already has standalone MCP+HTTP,
|
|
bank isolation, retain/recall/reflect, OpenAPI clients, thin tool adapters,
|
|
VectorChord/BM25/RRF retrieval, optional reranking, and a Go migration path.
|
|
ACE should eventually host this, but SF should keep targeting the
|
|
Singularity Memory contract until ACE proves parity behind that same
|
|
boundary.
|
|
- Target topology: ACE is the central brain/workbench/evolution service;
|
|
lightweight repo-local runners such as SF, Crush, or customer-approved
|
|
agents run inside customer repositories. Those runners collect traces,
|
|
triage TODO/self-report inputs, execute deterministic gates, and submit
|
|
evidence/results back to ACE. ACE learns, evolves prompts/skills/tools
|
|
offline, and returns reviewed candidate diffs or policies for the local
|
|
runner to apply.
|
|
- SF-to-Crush direction: preserve the parts of SF that are already working
|
|
well--AGENTS/TODO triage, `.sf/triage` artifacts, backlog promotion,
|
|
harness/eval gates, dispatch rules, and reviewable diffs--but make them
|
|
usable from a Crush-style repo-local runner. In that shape, Crush is the
|
|
customer-repo execution surface, SF is the workflow/gate library or adapter,
|
|
and ACE Coder is the linked brain/workbench that stores memory, runs
|
|
evolution, and sends back policies or candidate patches.
|
|
- SF-to-vtcode/Rust direction: port the hot, deterministic SF pieces toward a
|
|
Rust/vtcode-style core over time: repo scanning, artifact IO, dispatch state,
|
|
gate execution, JSONL triage stores, and local runner protocol glue. Keep the
|
|
current TS implementation as the working reference until the Rust path proves
|
|
parity.
|
|
- UX/runtime preference: keep Charm-style terminal UX where it adds operator
|
|
clarity, and keep Crush in view as the fast repo-local execution surface.
|
|
Rust/vtcode should optimize the core and protocol layer, not erase the good
|
|
local workflow experience.
|
|
- ACE creates/manages agents, memories, eval suites, skills, and policies.
|
|
External/customer repos stay outside the ACE server boundary: repo-local
|
|
runners own checkout access, file edits, tests, secrets exposure, and side
|
|
effects, then report traces/results/artifacts back to ACE.
|
|
|
|
Proper info flow:
|
|
- Raw human dump: root TODO.md.
|
|
- Raw agent self-report: .sf/BACKLOG.md and ~/.sf/agent/upstream-feedback.jsonl.
|
|
- Raw session-derived evidence: Singularity Memory / brain.
|
|
- First normalizer: /sf todo triage for TODO.md now; future /sf inbox triage
|
|
should normalize TODO.md + self-feedback + memory exports through the same
|
|
schema.
|
|
- Normalized pending items live in .sf/triage/inbox/*.jsonl with source, kind,
|
|
evidence, status, and created_at.
|
|
- Human-readable triage reports live in .sf/triage/reports/*.md.
|
|
- Eval-ready cases live in .sf/triage/evals/*.evals.jsonl.
|
|
- Human/planner-visible implementation tasks may be copied into .sf/BACKLOG.md
|
|
with /sf todo triage --backlog, but auto-mode must not execute backlog
|
|
directly. Planning/reassessment proposes promotion; user or explicit command
|
|
approves promotion into roadmap/slice/task artifacts.
|
|
- Memory-worthy notes are retained by memory/brain only after triage attaches
|
|
evidence/source; raw TODO notes are not memory.
|
|
- Preferred triage model tier: MiniMax M2.7 highspeed when available, then
|
|
MiniMax M2.5 highspeed, then other cheap/fast classification models. Triage
|
|
is structuring/classification, not final code editing.
|