395 lines
15 KiB
Markdown
395 lines
15 KiB
Markdown
# Repo-native Harness Architecture
|
|
|
|
## Purpose
|
|
|
|
This document defines how sf builds, runs, and evolves repository-specific harnesses while preserving the split between tracked repo contracts, `.sf/sf.db` operational state, and Singularity Memory.
|
|
|
|
## Goals
|
|
|
|
- Generate harnesses that match the repo's actual stack, risks, and production contract.
|
|
- Understand untracked files without silently owning them.
|
|
- Use deterministic evidence before model judgment.
|
|
- Retain proven lessons and anti-patterns into Singularity Memory.
|
|
- Evolve harnesses when the repo changes.
|
|
- Keep every generated file reviewable by the repo owner.
|
|
|
|
## Non-goals
|
|
|
|
- Replacing the repo's existing test runner or CI provider.
|
|
- Treating LLM judge scores as sufficient for critical engineering correctness.
|
|
- Storing memories or embeddings inside `.sf/sf.db`.
|
|
- Auto-staging untracked files that sf merely observed.
|
|
|
|
## System Flow
|
|
|
|
```
|
|
repo files + git + docs + CI + package manifests + prior runs
|
|
|
|
|
v
|
|
Repo Profiler
|
|
|
|
|
v
|
|
Risk Classifier
|
|
|
|
|
v
|
|
Harness Planner <---- Singularity Memory recall
|
|
|
|
|
v
|
|
Template Kit Registry
|
|
|
|
|
v
|
|
Harness Writer ---- tracked files: SPEC, ARCHITECTURE, harness/, gates/, CI snippets
|
|
|
|
|
v
|
|
Evidence Runner ---- .sf/sf.db: runs, cases, results, observations, drift
|
|
|
|
|
v
|
|
Memory Retainer ---- Singularity Memory: patterns, anti-patterns, repo risk notes
|
|
|
|
|
v
|
|
Evolution Engine ---- schedules harness update proposals
|
|
```
|
|
|
|
## Auto-flow Integration
|
|
|
|
Repo-native harnessing should enter the sf flow in stages. The early stages are read-only or evidence-only; prompt behavior changes come after fixtures exist.
|
|
|
|
| Flow point | Add now or later | Behavior |
|
|
|---|---|---|
|
|
| Session start / `sf init` | First implementation slice | Create a read-only `RepoProfile` snapshot from source, docs, CI, manifests, git status, and prior run history. |
|
|
| Plan phase | Later, after profiler tests | Surface missing harness coverage as a planning input, not as an automatic file write. |
|
|
| Execute phase | Later | Allow a task to adopt a proposed harness file only when the task plan claims it. |
|
|
| Verify phase | First implementation slice after manifest | Run harness commands and eval suites declared in `harness/manifest.json`. |
|
|
| PostUnit hook | First implementation slice | Store evidence summaries in `.sf/sf.db`; retain durable learnings and anti-patterns into Singularity Memory. |
|
|
| Reassess phase | Later | Use failed gates and repeated drift to propose harness updates. |
|
|
| Workflow prompt injection | Last | Inject top harness lessons and anti-patterns into prompts only after fixture coverage proves it improves outcomes. |
|
|
|
|
The immediate flow contract is:
|
|
|
|
1. Observe repo shape.
|
|
2. Record untracked files as observations only.
|
|
3. Compare observed risk against existing harness coverage.
|
|
4. Propose harness changes as reviewable artifacts.
|
|
5. Run accepted harnesses.
|
|
6. Retain evidence-backed lessons.
|
|
|
|
Do not jump directly to automatic prompt injection. That is where stale or noisy memory can degrade agent behavior before the evidence path is reliable.
|
|
|
|
## Data Ownership
|
|
|
|
| Data | Stored in | Why |
|
|
|---|---|---|
|
|
| Human contract | Tracked repo files | Reviewable, diffable, travels with code. |
|
|
| Executable gates and eval cases | Tracked repo files | CI can run them without sf internals. |
|
|
| Run history | `.sf/sf.db` | Local operational evidence and fast queries. |
|
|
| Repo profile snapshots | `.sf/sf.db` | Derived state, can be recomputed. |
|
|
| Untracked-file observations | `.sf/sf.db` | Important context, but not owned by sf. |
|
|
| Learnings and anti-patterns | Singularity Memory | Durable knowledge across sessions and tools. |
|
|
| Large reports | `.sf/reports/` or harness report dirs | Avoid bloating SQLite and prompts. |
|
|
|
|
## Repo Profiler
|
|
|
|
The profiler reads the repository and emits a `RepoProfile` snapshot:
|
|
|
|
```ts
|
|
interface RepoProfile {
|
|
profileId: string;
|
|
projectHash: string;
|
|
git: {
|
|
head: string | null;
|
|
branch: string | null;
|
|
remoteHash: string | null;
|
|
dirty: boolean;
|
|
changedFiles: RepoFileObservation[];
|
|
};
|
|
stacks: StackSignal[];
|
|
entrypoints: EntrypointSignal[];
|
|
tests: TestSignal[];
|
|
ci: CiSignal[];
|
|
docs: DocumentSignal[];
|
|
dataStores: DataStoreSignal[];
|
|
networkSurfaces: NetworkSurfaceSignal[];
|
|
riskHints: RiskHint[];
|
|
createdAt: number;
|
|
}
|
|
```
|
|
|
|
Inputs include:
|
|
|
|
- `git status --short`, `git ls-files`, current branch, remote, recent history.
|
|
- Package manifests such as `package.json`, `go.mod`, `Cargo.toml`, `pyproject.toml`, `flake.nix`, Dockerfiles, Compose files, devcontainers.
|
|
- Test directories, scripts, CI workflows, lint configs, migrations, fixtures, browser tests, smoke tests.
|
|
- Documentation such as `SPEC.md`, `ARCHITECTURE.md`, `AGENTS.md`, ADRs, runbooks, deployment docs.
|
|
- Source structure, entry points, route maps, command definitions, service definitions, generated files.
|
|
|
|
## Untracked File Policy
|
|
|
|
Untracked files are part of repo reality. sf must see them.
|
|
|
|
```ts
|
|
interface RepoFileObservation {
|
|
path: string;
|
|
gitStatus: "tracked" | "modified" | "deleted" | "renamed" | "untracked" | "ignored";
|
|
ownership: "sf_generated" | "user_owned" | "observed_only" | "candidate_harness";
|
|
language: string | null;
|
|
sizeBytes: number;
|
|
contentHash: string | null;
|
|
summary: string | null;
|
|
firstSeenAt: number;
|
|
lastSeenAt: number;
|
|
adoptedAt: number | null;
|
|
adoptionUnitId: string | null;
|
|
}
|
|
```
|
|
|
|
Rules:
|
|
|
|
- `untracked` defaults to `observed_only`.
|
|
- `observed_only` files can influence context, risk classification, and memory.
|
|
- `observed_only` files cannot be staged, deleted, reformatted, moved, or overwritten by automatic flows.
|
|
- A file becomes `sf_generated` or `candidate_harness` only when a unit plan declares that ownership and the diff is reviewable.
|
|
- Repeated observations can produce a harness recommendation, not an automatic commit.
|
|
|
|
This lets sf understand documents, scratch specs, generated reports, and local experiments without turning them into accidental repository history.
|
|
|
|
## Risk Classifier
|
|
|
|
The classifier maps `RepoProfile` to required harness families.
|
|
|
|
| Risk family | Signals | Required harness examples |
|
|
|---|---|---|
|
|
| Web | Next.js, Playwright, routes, CSS, browser tools | Playwright smoke, a11y, visual diffs, performance budget, browser trace replay |
|
|
| Agent | tool registry, prompts, MCP, provider SDKs | fixture replay, trajectory assertions, tool permission tests, injection red-team |
|
|
| RAG / retrieval | vector DB, embeddings, search, chunking | recall@k, MRR, NDCG, near-miss sets, faithfulness, context recall |
|
|
| Infrastructure | Nix, Docker, CI, deploy scripts | build matrix, secret scan, config validation, rollback checks |
|
|
| Database | migrations, SQL, ORM | migration up/down, data contract tests, destructive-change guard |
|
|
| Windows service | `_windows.go`, service managers, PowerShell | GOOS windows build, service install smoke, PowerShell contract tests |
|
|
| Security | auth, sessions, tokens, secrets | auth bypass tests, CSRF, rate limit, sensitive log scan |
|
|
| Performance | native bindings, compile-heavy code, hot loops | benchmark suite, regression threshold, flamegraph capture |
|
|
|
|
## Harness Planner
|
|
|
|
The planner compares the required harness families against the repo's current harness inventory.
|
|
|
|
Outputs:
|
|
|
|
- `missing`: risks with no harness coverage.
|
|
- `weak`: harness exists but lacks thresholds, fixtures, CI wiring, or reports.
|
|
- `stale`: harness references files/scripts that no longer exist.
|
|
- `overbroad`: harness is too slow or too generic for the risk.
|
|
- `proposed`: exact files and commands to add or modify.
|
|
|
|
Every proposal must include:
|
|
|
|
- Purpose.
|
|
- Consumer.
|
|
- Risk protected.
|
|
- Files written.
|
|
- Commands run.
|
|
- Blocking criteria.
|
|
- Rollback path.
|
|
|
|
## Template Kit Registry
|
|
|
|
Template kits are starting points, not permanent truth.
|
|
|
|
```ts
|
|
interface HarnessTemplateKit {
|
|
id: string;
|
|
title: string;
|
|
appliesWhen: RiskHint[];
|
|
writes: TemplateOutput[];
|
|
commands: HarnessCommand[];
|
|
requiredEvidence: EvidenceRequirement[];
|
|
evolutionRules: EvolutionRule[];
|
|
}
|
|
```
|
|
|
|
Core kits:
|
|
|
|
| Kit | Files |
|
|
|---|---|
|
|
| `go-service` | `harness/manifest.json`, `gates/go-test.sh`, `gates/go-vet.sh`, optional `gates/windows-build.sh` |
|
|
| `typescript-cli` | `gates/npm-build.sh`, `gates/typecheck.sh`, fixture replay config |
|
|
| `agent-runtime` | `harness/evals/agent/*.jsonl`, trajectory assertions, injection red-team cases |
|
|
| `rag-system` | retrieval datasets, recall metrics, near-miss cases, judge rubrics |
|
|
| `web-app` | Playwright smoke, visual baseline policy, a11y checks |
|
|
| `database` | migration tests, destructive SQL guard, seed data fixtures |
|
|
| `nix-project` | `nix flake check`, dev shell smoke, direnv policy checks |
|
|
| `charm-service` | Go build/test, Wish SSH smoke, VCR session recording checks |
|
|
|
|
## Harness Manifest
|
|
|
|
Each repo can carry a manifest:
|
|
|
|
```json
|
|
{
|
|
"schema": "sf.harness.v1",
|
|
"owner": "sf",
|
|
"generatedBy": "sf",
|
|
"repoProfileId": "01J...",
|
|
"riskFamilies": ["agent", "rag", "web"],
|
|
"commands": [
|
|
{
|
|
"id": "fixture-replay",
|
|
"command": "npm run test:fixtures",
|
|
"phase": "post_slice",
|
|
"blocks": true,
|
|
"timeoutSeconds": 300
|
|
}
|
|
],
|
|
"evalSuites": [
|
|
{
|
|
"id": "agent-tool-safety",
|
|
"path": "harness/evals/agent-tool-safety.jsonl",
|
|
"runner": "sf-eval",
|
|
"threshold": 0.95
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
The manifest is a tracked contract. `.sf/sf.db` stores run history for the manifest, not the manifest itself.
|
|
|
|
## Judge Rig
|
|
|
|
The judge rig follows one rule: deterministic evidence first, model judgment second.
|
|
|
|
### Implementation boundary
|
|
|
|
This is documented now; it is not part of the current repo-profiler slice.
|
|
|
|
Placement:
|
|
|
|
- SF core stays in TypeScript for repo profiling, harness proposal planning,
|
|
project preferences/config, and `.sf/sf.db` run ledgers.
|
|
- Deterministic and structural assertions can run locally from SF because they
|
|
already map to commands, AST checks, schemas, and git/diff checks.
|
|
- Model-judge execution and calibration should be a future Go/Charm service,
|
|
not another TS subsystem. Use `fantasy`/`catwalk` for model/provider routing,
|
|
Go HTTP/MCP APIs for SF integration, and `promwish`-style metrics when it is
|
|
daemonized.
|
|
- Durable calibration lessons belong in Singularity Memory. Local `.sf/sf.db`
|
|
stores run IDs, rubric hashes, model IDs, scores, raw output references, and
|
|
pass/fail summaries.
|
|
- Repo-local custom skills remain out of scope. Repo-specific eval suites or
|
|
harness files are later opt-in proposals only.
|
|
|
|
### Case format
|
|
|
|
```json
|
|
{
|
|
"id": "rag-role-reversal-001",
|
|
"kind": "retrieval",
|
|
"input": {
|
|
"query": "Which service owns failover routing?",
|
|
"expected_documents": ["docs/architecture.md#gateway"]
|
|
},
|
|
"assert": [
|
|
{ "type": "recall_at_k", "k": 5, "threshold": 1.0 },
|
|
{ "type": "context_recall", "threshold": 0.85 },
|
|
{ "type": "llm_rubric", "rubric": "Answer must identify the gateway and not the portal as the routing owner.", "advisory": true }
|
|
],
|
|
"tags": ["rag", "role-reversal", "near-miss"]
|
|
}
|
|
```
|
|
|
|
### Assertion types
|
|
|
|
| Type | Blocking default | Notes |
|
|
|---|---|---|
|
|
| `exit_code` | yes | Command pass/fail. |
|
|
| `contains` / `not_contains` | yes | Deterministic text contracts. |
|
|
| `json_schema` | yes | Structured output contract. |
|
|
| `ast_match` | yes | Code shape and API use. |
|
|
| `recall_at_k` | yes when calibrated | Retrieval coverage. |
|
|
| `mrr` / `ndcg` | yes when calibrated | Ranking quality. |
|
|
| `tool_call_f1` | yes when calibrated | Agent tool precision/recall. |
|
|
| `trajectory_goal_success` | no by default | Useful judge signal, requires trace data. |
|
|
| `llm_rubric` | no by default | Advisory until calibrated. |
|
|
| `factuality` | no by default | Needs references and judge calibration. |
|
|
| `select_best` | no | Useful for model/prompt comparison. |
|
|
|
|
### Judge calibration
|
|
|
|
Before a model judge can block:
|
|
|
|
- The rubric file must be tracked.
|
|
- The judge model and provider must be pinned.
|
|
- A calibration suite with known pass/fail examples must pass.
|
|
- A disagreement policy must exist for high-risk suites.
|
|
- The runner must store the judge prompt hash, model ID, score, reason, and raw output reference.
|
|
|
|
For high-risk agent or RAG gates, use either deterministic metrics or a judge quorum. A single uncalibrated model opinion is never enough.
|
|
|
|
Calibration lifecycle:
|
|
|
|
1. Build a golden set with known pass, fail, and ambiguous examples from real
|
|
bugs, traces, PR reviews, bad retrievals, prompt-injection attempts, and good
|
|
outputs.
|
|
2. Split it into calibration and held-out suites. Tune rubrics only against the
|
|
calibration suite.
|
|
3. Pin the judge provider, model ID, temperature, output schema, rubric file,
|
|
and rubric hash.
|
|
4. Measure false-pass rate, false-block rate, precision/recall/F1 for the
|
|
failure class, schema validity, quorum disagreement, and rerun stability.
|
|
5. Keep the judge advisory until the held-out suite meets the threshold for the
|
|
risk family.
|
|
6. Promote to blocking only with either a deterministic/structural companion
|
|
gate or a calibrated judge quorum.
|
|
7. Recalibrate when the rubric, judge model, provider, prompt wrapper, eval case
|
|
schema, or target workflow changes.
|
|
|
|
Default promotion bar:
|
|
|
|
- Critical/security gates: zero false passes on held-out critical failures, plus
|
|
deterministic or structural companion evidence.
|
|
- Product-quality gates: false-block rate low enough that developers do not
|
|
route around the gate; judge remains advisory if noisy.
|
|
- RAG/agent metrics: calibrated thresholds for recall@k, MRR/NDCG,
|
|
context-recall, tool-call F1, or trajectory success; model rubrics explain
|
|
failures but do not replace the metric.
|
|
|
|
## Singularity Memory Integration
|
|
|
|
Pre-dispatch:
|
|
|
|
- Recall repo-specific harness lessons from `project/{hash}`.
|
|
- Recall global engineering anti-patterns from `global/coding`.
|
|
- Inject only the top relevant items into context.
|
|
- Keep untracked observations summarized, not pasted wholesale.
|
|
|
|
Post-unit:
|
|
|
|
- Retain successful harness changes only after gates pass.
|
|
- Retain failures as anti-patterns with source unit and evidence IDs.
|
|
- Retain judge calibration results separately from normal coding memories.
|
|
- Link memory entries to `.sf/sf.db` run IDs and report paths.
|
|
|
|
Over time:
|
|
|
|
- Repeated failing eval cases become anti-patterns.
|
|
- Repeated successful fixes mature from candidate to established to proven.
|
|
- Stale memories decay unless revalidated by passing evidence.
|
|
- Drift events propose new harness tasks when repo reality changes.
|
|
|
|
## Web As TUI
|
|
|
|
For web repos, treat the browser as an evented terminal:
|
|
|
|
- The DOM/accessibility tree is the screen buffer.
|
|
- User actions are keypress/click/form events.
|
|
- Playwright traces are VCR recordings.
|
|
- Visual diffs are frame comparisons.
|
|
- Browser console and network logs are stderr/stdout.
|
|
|
|
The web harness should include action replay, semantic assertions, accessibility checks, screenshot diffs, and performance budgets. It should not rely on screenshots alone.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- sf can profile a repo and produce a stable `RepoProfile` snapshot.
|
|
- sf records untracked files as `observed_only` and never stages them by default.
|
|
- sf can generate a reviewable harness manifest and at least one executable gate from a template kit.
|
|
- sf can run a mixed deterministic/model-judge eval suite and store structured results.
|
|
- sf retains successful patterns and failed anti-patterns into Singularity Memory with evidence links.
|
|
- sf can detect harness drift and propose a follow-up unit instead of silently mutating files.
|