singularity-forge/docs/dev/ADR-018-repo-native-harness-evolution.md

# ADR-018: Repo-native harness evolution

**Date**: 2026-04-29
**Status**: proposed

## Context

sf already has a phase machine, verification gates, fixture replay, safety checks, worktree isolation, local `.sf/sf.db` state, and a planned Singularity Memory integration. The missing product layer is making those mechanisms useful in every repository sf works on.

Generic scaffolds are useful only on day one. A repo changes shape: packages appear, CI changes, tools move, untracked documents show up, risk concentrates in different modules, and failure modes repeat. sf should learn that shape and evolve the repo's harnesses over time.

Recent ecosystem references point in the same direction:

- OpenAI Evals uses explicit eval suites and custom evaluators for LLM systems.
- promptfoo uses declarative test cases, deterministic assertions, model-graded assertions, trajectory checks, caching, concurrency, red-team packs, and CI-friendly reports.
- Ragas treats RAG and agent evaluation as datasets, metrics, and repeated experiments.
- Backstage Software Templates show the right starting point for parameterized scaffolding, but they stop at creation time.
- IBM's agentic engineering framing emphasizes human oversight, modular agent work, RAG-grounded context, governance, review loops, and CI integration.

## Decision

sf will own a **Repo-native Harness Evolution** system.

The system starts from template kits, then adapts them to the repository by reading source code, docs, CI, package manifests, existing tests, git state, and prior run evidence. It writes durable files into the repo only through normal sf planning and review flows, and it records operational evidence in `.sf/sf.db` plus durable lessons in Singularity Memory.

### Flow integration stance

Add the contract to markdown now. Add runtime flow behavior later behind tests.

The first implementation should not start by changing the worker prompt or
writing repo-local harness files. It should add a pre-plan profile snapshot and
a post-unit evidence retention hook, because those are observable and testable
without changing every dispatch. Once those are stable, sf can inject
harness/memory context into planning and verification prompts.

Near-term repository-write boundary:

- All repositories use the same sf built-in skills and harness behavior.
- sf MUST NOT generate repo-local custom skill packs such as `.agents/skills/`
  for project repos.
- sf MUST NOT create tracked `harness/`, `gates/`, CI, or repo spec files as
  part of normal initialization.
- The only project-level file write allowed by this stream before the explicit
  harness-writer phase is sf project preferences/config, such as
  `.sf/PREFERENCES.md` or `.sf/preferences.md`, when the user asks for project
  preferences.
- `.sf/sf.db` may record ignored operational state, including repo profiles and
  untracked-file observations. That is not repo ownership and must not be
  staged by default.

| When | Flow addition | Why |
|---|---|---|
| Now, in docs | Define repo profiling, untracked observation, harness planning, eval/judge rig, and memory retention contracts. | Gives implementation a stable target. |
| First code slice | Add read-only repo profile snapshot before planning. | Lets sf understand repo shape without taking ownership or writing tracked files. |
| Second code slice | Add post-unit evidence retention into `.sf/sf.db` and Singularity Memory. | Converts gate results into future guidance. |
| Third code slice | Add harness proposal generation as a planning artifact. | Produces dry-run proposals only; no tracked repo files are written. |
| Later | Inject harness/memory context into runtime prompts and workflow templates. | This changes agent behavior and needs regression fixtures. |
| Explicit opt-in later | Enable Harness Writer for reviewed diffs. | Allows tracked harness files only when a unit plan claims them and the user accepts the diff. |

### Files, database, and memory

Use all three layers, with separate responsibilities:

| Layer | Role | Examples |
|---|---|---|
| Tracked repo files | Future durable contract and executable harness after explicit opt-in | `SPEC.md`, `ARCHITECTURE.md`, `harness/manifest.json`, `harness/evals/*.jsonl`, `gates/*.sh`, CI workflow snippets |
| `.sf/sf.db` | Operational state and evidence ledger | repo profile snapshots, harness inventory, eval runs, gate results, drift events, untracked-file observations |
| Singularity Memory | Cross-session knowledge | proven patterns, anti-patterns, recurring failures, repo-specific risk notes, judge calibration lessons |

SQLite is not the knowledge backend. Singularity Memory is not the orchestration ledger. Repo files are the human-reviewable contract.

### Untracked files

sf MUST understand untracked files, but MUST NOT silently own them.

Untracked files are first-class observations:

- They are included in repo profiling, risk classification, and context summaries.
- They are recorded in `.sf/sf.db` as `observed_only` until explicitly adopted.
- They may produce Singularity Memory entries tagged as repo observations.
- They may influence planning and harness recommendations.

Untracked files are not automatically staged, deleted, overwritten, renamed, or treated as sf output unless a unit plan claims them and the resulting diff is reviewed. Indexing is not ownership.

### Promptfoo inspiration

sf should use promptfoo as design inspiration, not as the core control plane.

Adopt these ideas:

- Declarative eval suites.
- Matrix runs across prompts, providers, models, fixtures, and repo states.
- Deterministic assertions first, model judges second.
- Rubric prompts versioned as artifacts.
- Trajectory checks for agents and tool use.
- Red-team packs for injection, data leakage, unsafe actions, and policy failures.
- Caching and concurrency for cheap repeated runs.
- Local-first execution with CI reports.

Do not copy these blindly:

- Do not make LLM judges the only pass/fail authority for high-risk work.
- Do not let web dashboards become the source of truth.
- Do not treat prompt-level evals as enough for software engineering. sf needs repo, diff, tool, CI, and runtime evidence.

### Judge rig

sf will define a judge rig with four evaluator classes:

| Class | Use | Authority |
|---|---|---|
| Deterministic | compile, tests, typecheck, lint, schema, exact match, forbidden diff, secret scan | Can block alone |
| Structural | AST checks, API contract checks, migration checks, dependency graph, changed-file ownership | Can block alone when rule is exact |
| Retrieval and agent metrics | recall@k, MRR, NDCG, context recall, tool-call F1, trajectory goal success | Blocks when threshold is calibrated |
| Model judge | rubrics, factuality, faithfulness, human-readable quality, comparative selection | Advisory by default, blocking only with calibration and quorum |

High-risk suites require at least one deterministic or structural gate. Model judges can summarize, rank, and flag, but they do not replace executable evidence.

Implementation stance: deterministic and structural gates can stay in SF core,
but model-judge execution and calibration should be a later Go/Charm service
using `fantasy`/`catwalk`, with durable calibration lessons retained into
Singularity Memory. This ADR documents the contract now; it does not enable a
runtime judge service or prompt-behavior change yet.

## Architecture

The system is split into bounded components:

| Component | Purpose |
|---|---|
| Repo Profiler | Builds a structured profile from files, docs, tests, CI, manifests, migrations, containers, git status, and recent history. |
| Risk Classifier | Maps the profile to risk families: web, agent, RAG, infrastructure, Windows service, database, security, release, performance. |
| Harness Planner | Compares current harnesses to risk requirements and proposes missing gates/evals/docs. |
| Template Kit Registry | Holds parameterized harness templates and adapter rules. |
| Harness Writer | Writes reviewed repo files and never overwrites user-owned files without a planned diff. |
| Evidence Runner | Runs gates/evals, captures reports, stores summaries in `.sf/sf.db`. |
| Memory Retainer | Stores proven patterns and anti-patterns in Singularity Memory after evidence is known. |
| Evolution Engine | Detects drift and schedules harness updates when repo reality changes. |

Detailed design is in `repo-native-harness-architecture.md`.

## Consequences

**Positive**

- sf becomes safer over time inside each repo instead of repeating the same bootstrapping work.
- Harnesses match the actual repository rather than a generic stack guess.
- Repeated failures become anti-pattern memory and then concrete gates.
- Untracked but important files are visible to sf without being accidentally committed or destroyed.
- Promptfoo/evals/Ragas style AI evaluation becomes one part of a broader engineering harness.

**Negative**

- More metadata exists in `.sf/sf.db`.
- Harness drift detection creates new planning work.
- Judge calibration becomes a real maintenance surface.
- Repo owners need a clear review flow for adopting generated harness files.

## Risks and mitigations

| Risk | Mitigation |
|---|---|
| sf commits files it only observed | Observation records carry ownership state. `observed_only` files are never staged by default. |
| Generated harnesses become template junk | Every generated harness must have a named risk, consumer, command, and acceptance threshold. |
| LLM judge drift creates false confidence | Judge prompts are versioned, calibration sets are mandatory before blocking use, and deterministic gates remain primary. |
| Memory accumulates stale advice | Singularity Memory entries carry evidence IDs, maturity, decay, and negative feedback. |
| Repo-specific harnesses become hard to understand | Human-readable specs and manifests live in tracked files; `.sf/sf.db` is only the ledger. |

## Out of scope

- Replacing existing repo CI systems.
- Making promptfoo, Ragas, LangSmith, or any external platform a required dependency.
- Silent migration of all repositories to one universal harness layout.
- Auto-committing untracked files because they were indexed.

## Sequencing

| Stage | Work | Result |
|---|---|---|
| 1 | Add repo profile snapshots and untracked observation model. | sf understands repo shape without taking ownership. |
| 2 | Add template kit registry and harness manifest format. | sf can generate dry-run harness proposals without writing repo files. |
| 3 | Add judge rig and eval suite runner. | AI and agent behavior becomes measurable; model judges remain advisory until calibrated. |
| 4 | Connect evidence to Singularity Memory. | Patterns and anti-patterns improve future dispatch. |
| 5 | Add Go/Charm judge-calibration service. | Calibrated judge runs can be reused across repos and retained into Singularity Memory. |
| 6 | Add drift detection and automatic harness update proposals. | Harnesses evolve with the repo as proposals. |
| 7 | Add explicit opt-in Harness Writer. | Reviewed repo diffs can create tracked harness files; repo-local skills remain out of scope unless separately accepted. |

## References

- `SPEC.md` sections 13 and 16.
- `docs/dev/architecture.md`.
- `docs/dev/ci-cd-pipeline.md`.
- OpenAI Evals: https://github.com/openai/evals
- promptfoo docs: https://www.promptfoo.dev/docs/intro/
- promptfoo model-graded metrics: https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/
- Ragas docs: https://docs.ragas.io/en/stable/
- Backstage Software Templates: https://backstage.io/docs/features/software-templates/
- IBM agentic engineering: https://www.ibm.com/think/topics/agentic-engineering