singularity-forge/docs/dev/repo-native-harness-template-kits.md
2026-04-29 18:28:45 +02:00

9.5 KiB

Repo-native Harness Template Kits and Evaluation Research

Purpose

This document captures the external patterns sf should borrow for repo-native harness generation and evolution. It is a design input for ADR-018 and repo-native-harness-architecture.md.

Summary

Use template kits for the first scaffold, then let sf specialize them from source, docs, CI, evidence, and Singularity Memory.

The useful split:

  • Backstage-style templates for parameterized scaffolding.
  • promptfoo-style declarative eval matrices, assertions, caching, red-team packs, and reports.
  • OpenAI Evals-style custom eval registries and completion/system adapters.
  • Ragas-style RAG and agent metrics, datasets, experiments, and testset generation.
  • IBM-style agentic engineering governance: human oversight, modular tasks, RAG-grounded context, review loops, and CI integration.

sf's improvement is that the harness is repo-native and evolving. It is not a one-shot template and not only prompt evaluation.

Add Now vs Later

Timing Add to markdown/spec Add to runtime flow
Now Template kit contract, untracked observation policy, judge rig, eval runner result shape, memory feedback loop. No runtime prompt changes yet.
First implementation Profiler output schema and harness manifest schema. Read-only profiling at session start or sf init.
Next Concrete kits for agent-runtime, rag-system, web-app, go-service, and nix-project. Manifest-driven gate/eval runner in verify phase.
Later Calibration policy for model judges and drift policy for harness evolution. Go/Charm judge-calibration service, prompt injection of recalled harness lessons, and automatic harness-update proposals.

This keeps the current docs ahead of implementation while avoiding a hidden behavior change in the agent loop.

The model-judge runner is intentionally not implemented in this documentation slice. When it is built, it should be a Go/Charm service adjacent to the Singularity Memory platform, not a repo-local skill pack and not a hidden SF core behavior change.

What To Borrow

Source Useful pattern SF adaptation
Backstage Software Templates Skeletons with parameters, reviewable creation flow, action logs HarnessTemplateKit registry with dry-run diffs and explicit ownership.
promptfoo evals Declarative cases, providers, prompts, assertions, matrix comparison harness/evals/*.jsonl plus sf eval run, with optional promptfoo import/export.
promptfoo model-graded metrics llm-rubric, factuality, context recall/relevance/faithfulness, trajectory goal success SF judge rig with calibrated rubrics and deterministic gates as primary blockers.
promptfoo red teaming Security-focused attack packs and CI reports Repo-specific red-team suites for agents, tools, MCP, browser, and RAG.
OpenAI Evals Custom evals for LLM systems and private task-specific eval data SF eval suites tied to repo risks, run IDs, and tracked artifacts.
Ragas RAG metrics, agent/tool metrics, testset generation, experiment loop Retrieval/agent suites with recall@k, context recall, tool-call F1, goal accuracy, and near-miss fixtures.
IBM agentic engineering Governance, human oversight, modular work, RAG grounding, CI review loops SF enforces reviewable harness changes, evidence ledgers, and memory-backed evolution.

What Not To Borrow

Pattern Why not
One-shot scaffolding as "done" Repos drift. Harnesses need lifecycle management.
Judge-only pass/fail Too risky for correctness, security, migrations, infra, and production behavior.
External SaaS as source of truth sf should run locally and keep repo contracts in the repo.
Prompt-only evals Software agents also mutate files, call tools, run commands, and change deployment risk.
Hidden ownership of generated files Every file adoption must be explicit and reviewable.

Template Kit Shape

Each kit must answer:

  • What risk does it cover?
  • What repo signals make it applicable?
  • What files will it write?
  • What commands will it run?
  • What evidence blocks or passes?
  • What drift should make sf revisit it?
{
  "id": "rag-system",
  "appliesWhen": ["embeddings", "retrieval", "vector-db"],
  "writes": [
    "harness/manifest.json",
    "harness/evals/retrieval-recall.jsonl",
    "harness/evals/structural-near-misses.jsonl",
    "gates/rag-eval.sh"
  ],
  "commands": [
    {
      "id": "rag-eval",
      "command": "sf eval run harness/evals/retrieval-recall.jsonl",
      "blocks": true
    }
  ],
  "evolutionRules": [
    "When retrieval code changes, rerun recall suites.",
    "When a missed document causes a failure, add it as a near-miss case.",
    "When chunking changes, compare recall and answer faithfulness before promotion."
  ]
}

Core Kits

Agent Runtime

Use when the repo has agent loops, MCP tools, prompt routers, model providers, or autonomous workflows.

Files:

  • harness/evals/agent-tool-safety.jsonl
  • harness/evals/trajectory-goal-success.jsonl
  • harness/evals/injection-red-team.jsonl
  • gates/agent-fixture-replay.sh
  • gates/agent-red-team.sh

Metrics:

  • Fixture replay pass rate.
  • Tool-call exact match for required calls.
  • Tool-call F1 for flexible trajectories.
  • Forbidden tool call count.
  • Goal success judge score, advisory until calibrated.
  • Prompt injection refusal/containment.

RAG / Retrieval

Use when the repo retrieves docs, code, tickets, memories, vectors, or search results.

Files:

  • harness/evals/retrieval-recall.jsonl
  • harness/evals/context-faithfulness.jsonl
  • harness/evals/structural-near-misses.jsonl
  • gates/retrieval-eval.sh

Metrics:

  • recall@k.
  • MRR.
  • NDCG.
  • Context recall.
  • Context faithfulness.
  • Noise sensitivity.
  • Near-miss failure rate.

Important: do not optimize precision alone. A smaller, cleaner context that drops required evidence is a regression.

Web App

Use when the repo has browser-visible UI.

Files:

  • harness/web/smoke.spec.ts
  • harness/web/a11y.spec.ts
  • harness/web/visual.spec.ts
  • gates/web-smoke.sh

Metrics:

  • User workflow pass/fail.
  • Accessibility violations by severity.
  • Screenshot diff threshold.
  • Console error count.
  • Network failure count.
  • Core route performance budget.

Go / Windows Service

Use when the repo has Go services or Windows agents.

Files:

  • gates/go-test.sh
  • gates/go-vet.sh
  • gates/windows-build.sh
  • harness/contracts/windows-service.json

Metrics:

  • go test ./....
  • go vet ./....
  • GOOS=windows GOARCH=amd64 go build.
  • Service install/start/stop contract in a Windows-capable environment.

Nix Project

Use when the repo has flake.nix, shell.nix, or direnv policy.

Files:

  • gates/nix-flake-check.sh
  • gates/dev-shell-smoke.sh
  • harness/contracts/devshell.json

Metrics:

  • nix flake check.
  • Dev shell starts and exposes expected tool versions.
  • Direnv policy is explicit.
  • Build caches are used where configured.

Charm Service

Use when the repo has Go services using Charm libraries such as Wish, Bubble Tea, fantasy, or x/vcr.

Files:

  • gates/charm-go-test.sh
  • gates/wish-smoke.sh
  • harness/vcr/sessions/*.jsonl

Metrics:

  • SSH app starts and responds.
  • TUI smoke trace replays.
  • x/vcr recording can be replayed.
  • Metrics endpoint responds when promwish is enabled.

Eval Runner Contract

An SF eval run should produce:

{
  "schema": "sf.eval.result.v1",
  "suiteId": "agent-tool-safety",
  "runId": "01J...",
  "profileId": "01J...",
  "startedAt": 1770000000000,
  "endedAt": 1770000001200,
  "cases": 42,
  "passed": 40,
  "failed": 2,
  "score": 0.9523,
  "blocking": true,
  "judgePromptHash": "sha256:...",
  "provider": "openai:gpt-5-mini",
  "reportPath": ".sf/reports/evals/agent-tool-safety/01J....json"
}

Store the summary in .sf/sf.db. Store larger artifacts as files. Retain only durable lessons in Singularity Memory.

For model-judge cases, the runner must also store calibration metadata:

  • Rubric path and content hash.
  • Judge provider, model ID, temperature, and output schema version.
  • Calibration suite ID and held-out suite ID.
  • False-pass rate, false-block rate, precision/recall/F1, quorum disagreement, and rerun-stability summary.
  • Raw judge output reference for later audit.

Model-judge suites are advisory until their calibration metadata says they are eligible to block.

Singularity Memory Feedback Loop

Retain:

  • Harnesses that repeatedly passed and caught real regressions.
  • Failures that required human correction.
  • Judge rubrics that showed false positives or false negatives.
  • Repo-specific risk notes discovered during profiling.
  • Useful untracked observations, still marked as observations.

Recall:

  • Before planning harness changes.
  • Before dispatching high-risk work.
  • Before running model-judge evals.
  • Before deleting or replacing any generated harness file.

Feedback:

  • Positive when recalled memory helped a gate pass or prevented a repeat bug.
  • Negative when recalled memory was stale or led to a bad recommendation.
  • Validation when a memory still matches current repo evidence.

References