9.5 KiB
Repo-native Harness Template Kits and Evaluation Research
Purpose
This document captures the external patterns sf should borrow for repo-native harness generation and evolution. It is a design input for ADR-018 and repo-native-harness-architecture.md.
Summary
Use template kits for the first scaffold, then let sf specialize them from source, docs, CI, evidence, and Singularity Memory.
The useful split:
- Backstage-style templates for parameterized scaffolding.
- promptfoo-style declarative eval matrices, assertions, caching, red-team packs, and reports.
- OpenAI Evals-style custom eval registries and completion/system adapters.
- Ragas-style RAG and agent metrics, datasets, experiments, and testset generation.
- IBM-style agentic engineering governance: human oversight, modular tasks, RAG-grounded context, review loops, and CI integration.
sf's improvement is that the harness is repo-native and evolving. It is not a one-shot template and not only prompt evaluation.
Add Now vs Later
| Timing | Add to markdown/spec | Add to runtime flow |
|---|---|---|
| Now | Template kit contract, untracked observation policy, judge rig, eval runner result shape, memory feedback loop. | No runtime prompt changes yet. |
| First implementation | Profiler output schema and harness manifest schema. | Read-only profiling at session start or sf init. |
| Next | Concrete kits for agent-runtime, rag-system, web-app, go-service, and nix-project. |
Manifest-driven gate/eval runner in verify phase. |
| Later | Calibration policy for model judges and drift policy for harness evolution. | Go/Charm judge-calibration service, prompt injection of recalled harness lessons, and automatic harness-update proposals. |
This keeps the current docs ahead of implementation while avoiding a hidden behavior change in the agent loop.
The model-judge runner is intentionally not implemented in this documentation slice. When it is built, it should be a Go/Charm service adjacent to the Singularity Memory platform, not a repo-local skill pack and not a hidden SF core behavior change.
What To Borrow
| Source | Useful pattern | SF adaptation |
|---|---|---|
| Backstage Software Templates | Skeletons with parameters, reviewable creation flow, action logs | HarnessTemplateKit registry with dry-run diffs and explicit ownership. |
| promptfoo evals | Declarative cases, providers, prompts, assertions, matrix comparison | harness/evals/*.jsonl plus sf eval run, with optional promptfoo import/export. |
| promptfoo model-graded metrics | llm-rubric, factuality, context recall/relevance/faithfulness, trajectory goal success |
SF judge rig with calibrated rubrics and deterministic gates as primary blockers. |
| promptfoo red teaming | Security-focused attack packs and CI reports | Repo-specific red-team suites for agents, tools, MCP, browser, and RAG. |
| OpenAI Evals | Custom evals for LLM systems and private task-specific eval data | SF eval suites tied to repo risks, run IDs, and tracked artifacts. |
| Ragas | RAG metrics, agent/tool metrics, testset generation, experiment loop | Retrieval/agent suites with recall@k, context recall, tool-call F1, goal accuracy, and near-miss fixtures. |
| IBM agentic engineering | Governance, human oversight, modular work, RAG grounding, CI review loops | SF enforces reviewable harness changes, evidence ledgers, and memory-backed evolution. |
What Not To Borrow
| Pattern | Why not |
|---|---|
| One-shot scaffolding as "done" | Repos drift. Harnesses need lifecycle management. |
| Judge-only pass/fail | Too risky for correctness, security, migrations, infra, and production behavior. |
| External SaaS as source of truth | sf should run locally and keep repo contracts in the repo. |
| Prompt-only evals | Software agents also mutate files, call tools, run commands, and change deployment risk. |
| Hidden ownership of generated files | Every file adoption must be explicit and reviewable. |
Template Kit Shape
Each kit must answer:
- What risk does it cover?
- What repo signals make it applicable?
- What files will it write?
- What commands will it run?
- What evidence blocks or passes?
- What drift should make sf revisit it?
{
"id": "rag-system",
"appliesWhen": ["embeddings", "retrieval", "vector-db"],
"writes": [
"harness/manifest.json",
"harness/evals/retrieval-recall.jsonl",
"harness/evals/structural-near-misses.jsonl",
"gates/rag-eval.sh"
],
"commands": [
{
"id": "rag-eval",
"command": "sf eval run harness/evals/retrieval-recall.jsonl",
"blocks": true
}
],
"evolutionRules": [
"When retrieval code changes, rerun recall suites.",
"When a missed document causes a failure, add it as a near-miss case.",
"When chunking changes, compare recall and answer faithfulness before promotion."
]
}
Core Kits
Agent Runtime
Use when the repo has agent loops, MCP tools, prompt routers, model providers, or autonomous workflows.
Files:
harness/evals/agent-tool-safety.jsonlharness/evals/trajectory-goal-success.jsonlharness/evals/injection-red-team.jsonlgates/agent-fixture-replay.shgates/agent-red-team.sh
Metrics:
- Fixture replay pass rate.
- Tool-call exact match for required calls.
- Tool-call F1 for flexible trajectories.
- Forbidden tool call count.
- Goal success judge score, advisory until calibrated.
- Prompt injection refusal/containment.
RAG / Retrieval
Use when the repo retrieves docs, code, tickets, memories, vectors, or search results.
Files:
harness/evals/retrieval-recall.jsonlharness/evals/context-faithfulness.jsonlharness/evals/structural-near-misses.jsonlgates/retrieval-eval.sh
Metrics:
- recall@k.
- MRR.
- NDCG.
- Context recall.
- Context faithfulness.
- Noise sensitivity.
- Near-miss failure rate.
Important: do not optimize precision alone. A smaller, cleaner context that drops required evidence is a regression.
Web App
Use when the repo has browser-visible UI.
Files:
harness/web/smoke.spec.tsharness/web/a11y.spec.tsharness/web/visual.spec.tsgates/web-smoke.sh
Metrics:
- User workflow pass/fail.
- Accessibility violations by severity.
- Screenshot diff threshold.
- Console error count.
- Network failure count.
- Core route performance budget.
Go / Windows Service
Use when the repo has Go services or Windows agents.
Files:
gates/go-test.shgates/go-vet.shgates/windows-build.shharness/contracts/windows-service.json
Metrics:
go test ./....go vet ./....GOOS=windows GOARCH=amd64 go build.- Service install/start/stop contract in a Windows-capable environment.
Nix Project
Use when the repo has flake.nix, shell.nix, or direnv policy.
Files:
gates/nix-flake-check.shgates/dev-shell-smoke.shharness/contracts/devshell.json
Metrics:
nix flake check.- Dev shell starts and exposes expected tool versions.
- Direnv policy is explicit.
- Build caches are used where configured.
Charm Service
Use when the repo has Go services using Charm libraries such as Wish, Bubble Tea, fantasy, or x/vcr.
Files:
gates/charm-go-test.shgates/wish-smoke.shharness/vcr/sessions/*.jsonl
Metrics:
- SSH app starts and responds.
- TUI smoke trace replays.
- x/vcr recording can be replayed.
- Metrics endpoint responds when promwish is enabled.
Eval Runner Contract
An SF eval run should produce:
{
"schema": "sf.eval.result.v1",
"suiteId": "agent-tool-safety",
"runId": "01J...",
"profileId": "01J...",
"startedAt": 1770000000000,
"endedAt": 1770000001200,
"cases": 42,
"passed": 40,
"failed": 2,
"score": 0.9523,
"blocking": true,
"judgePromptHash": "sha256:...",
"provider": "openai:gpt-5-mini",
"reportPath": ".sf/reports/evals/agent-tool-safety/01J....json"
}
Store the summary in .sf/sf.db. Store larger artifacts as files. Retain only durable lessons in Singularity Memory.
For model-judge cases, the runner must also store calibration metadata:
- Rubric path and content hash.
- Judge provider, model ID, temperature, and output schema version.
- Calibration suite ID and held-out suite ID.
- False-pass rate, false-block rate, precision/recall/F1, quorum disagreement, and rerun-stability summary.
- Raw judge output reference for later audit.
Model-judge suites are advisory until their calibration metadata says they are eligible to block.
Singularity Memory Feedback Loop
Retain:
- Harnesses that repeatedly passed and caught real regressions.
- Failures that required human correction.
- Judge rubrics that showed false positives or false negatives.
- Repo-specific risk notes discovered during profiling.
- Useful untracked observations, still marked as observations.
Recall:
- Before planning harness changes.
- Before dispatching high-risk work.
- Before running model-judge evals.
- Before deleting or replacing any generated harness file.
Feedback:
- Positive when recalled memory helped a gate pass or prevented a repeat bug.
- Negative when recalled memory was stale or led to a bad recommendation.
- Validation when a memory still matches current repo evidence.
References
- OpenAI Evals: https://github.com/openai/evals
- promptfoo intro: https://www.promptfoo.dev/docs/intro/
- promptfoo model-graded metrics: https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/
- Ragas docs: https://docs.ragas.io/en/stable/
- Backstage Software Templates: https://backstage.io/docs/features/software-templates/
- IBM agentic engineering: https://www.ibm.com/think/topics/agentic-engineering