sf already has a phase machine, verification gates, fixture replay, safety checks, worktree isolation, local `.sf/sf.db` state, and a planned Singularity Memory integration. The missing product layer is making those mechanisms useful in every repository sf works on.
Generic scaffolds are useful only on day one. A repo changes shape: packages appear, CI changes, tools move, untracked documents show up, risk concentrates in different modules, and failure modes repeat. sf should learn that shape and evolve the repo's harnesses over time.
Recent ecosystem references point in the same direction:
- OpenAI Evals uses explicit eval suites and custom evaluators for LLM systems.
- promptfoo uses declarative test cases, deterministic assertions, model-graded assertions, trajectory checks, caching, concurrency, red-team packs, and CI-friendly reports.
- Ragas treats RAG and agent evaluation as datasets, metrics, and repeated experiments.
- Backstage Software Templates show the right starting point for parameterized scaffolding, but they stop at creation time.
- IBM's agentic engineering framing emphasizes human oversight, modular agent work, RAG-grounded context, governance, review loops, and CI integration.
## Decision
sf will own a **Repo-native Harness Evolution** system.
The system starts from template kits, then adapts them to the repository by reading source code, docs, CI, package manifests, existing tests, git state, and prior run evidence. It writes durable files into the repo only through normal sf planning and review flows, and it records operational evidence in `.sf/sf.db` plus durable lessons in Singularity Memory.
### Flow integration stance
Add the contract to markdown now. Add runtime flow behavior later behind tests.
| Explicit opt-in later | Enable Harness Writer for reviewed diffs. | Allows tracked harness files only when a unit plan claims them and the user accepts the diff. |
SQLite is not the knowledge backend. Singularity Memory is not the orchestration ledger. Repo files are the human-reviewable contract.
### Untracked files
sf MUST understand untracked files, but MUST NOT silently own them.
Untracked files are first-class observations:
- They are included in repo profiling, risk classification, and context summaries.
- They are recorded in `.sf/sf.db` as `observed_only` until explicitly adopted.
- They may produce Singularity Memory entries tagged as repo observations.
- They may influence planning and harness recommendations.
Untracked files are not automatically staged, deleted, overwritten, renamed, or treated as sf output unless a unit plan claims them and the resulting diff is reviewed. Indexing is not ownership.
### Promptfoo inspiration
sf should use promptfoo as design inspiration, not as the core control plane.
Adopt these ideas:
- Declarative eval suites.
- Matrix runs across prompts, providers, models, fixtures, and repo states.
- Deterministic assertions first, model judges second.
- Rubric prompts versioned as artifacts.
- Trajectory checks for agents and tool use.
- Red-team packs for injection, data leakage, unsafe actions, and policy failures.
- Caching and concurrency for cheap repeated runs.
- Local-first execution with CI reports.
Do not copy these blindly:
- Do not make LLM judges the only pass/fail authority for high-risk work.
- Do not let web dashboards become the source of truth.
- Do not treat prompt-level evals as enough for software engineering. sf needs repo, diff, tool, CI, and runtime evidence.
### Judge rig
sf will define a judge rig with four evaluator classes:
| Structural | AST checks, API contract checks, migration checks, dependency graph, changed-file ownership | Can block alone when rule is exact |
| Retrieval and agent metrics | recall@k, MRR, NDCG, context recall, tool-call F1, trajectory goal success | Blocks when threshold is calibrated |
| Model judge | rubrics, factuality, faithfulness, human-readable quality, comparative selection | Advisory by default, blocking only with calibration and quorum |
High-risk suites require at least one deterministic or structural gate. Model judges can summarize, rank, and flag, but they do not replace executable evidence.