Mikael Hugo 93c1bbcb9a docs: plan judge calibration service

2026-04-29 18:28:45 +02:00

9.5 KiB

Raw Blame History

Repo-native Harness Template Kits and Evaluation Research

Purpose

This document captures the external patterns sf should borrow for repo-native harness generation and evolution. It is a design input for ADR-018 and repo-native-harness-architecture.md.

Summary

Use template kits for the first scaffold, then let sf specialize them from source, docs, CI, evidence, and Singularity Memory.

The useful split:

Backstage-style templates for parameterized scaffolding.
promptfoo-style declarative eval matrices, assertions, caching, red-team packs, and reports.
OpenAI Evals-style custom eval registries and completion/system adapters.
Ragas-style RAG and agent metrics, datasets, experiments, and testset generation.
IBM-style agentic engineering governance: human oversight, modular tasks, RAG-grounded context, review loops, and CI integration.

sf's improvement is that the harness is repo-native and evolving. It is not a one-shot template and not only prompt evaluation.

Add Now vs Later

Timing	Add to markdown/spec	Add to runtime flow
Now	Template kit contract, untracked observation policy, judge rig, eval runner result shape, memory feedback loop.	No runtime prompt changes yet.
First implementation	Profiler output schema and harness manifest schema.	Read-only profiling at session start or `sf init`.
Next	Concrete kits for `agent-runtime`, `rag-system`, `web-app`, `go-service`, and `nix-project`.	Manifest-driven gate/eval runner in verify phase.
Later	Calibration policy for model judges and drift policy for harness evolution.	Go/Charm judge-calibration service, prompt injection of recalled harness lessons, and automatic harness-update proposals.

This keeps the current docs ahead of implementation while avoiding a hidden behavior change in the agent loop.

The model-judge runner is intentionally not implemented in this documentation slice. When it is built, it should be a Go/Charm service adjacent to the Singularity Memory platform, not a repo-local skill pack and not a hidden SF core behavior change.

What To Borrow

Source	Useful pattern	SF adaptation
Backstage Software Templates	Skeletons with parameters, reviewable creation flow, action logs	`HarnessTemplateKit` registry with dry-run diffs and explicit ownership.
promptfoo evals	Declarative cases, providers, prompts, assertions, matrix comparison	`harness/evals/*.jsonl` plus `sf eval run`, with optional promptfoo import/export.
promptfoo model-graded metrics	`llm-rubric`, factuality, context recall/relevance/faithfulness, trajectory goal success	SF judge rig with calibrated rubrics and deterministic gates as primary blockers.
promptfoo red teaming	Security-focused attack packs and CI reports	Repo-specific red-team suites for agents, tools, MCP, browser, and RAG.
OpenAI Evals	Custom evals for LLM systems and private task-specific eval data	SF eval suites tied to repo risks, run IDs, and tracked artifacts.
Ragas	RAG metrics, agent/tool metrics, testset generation, experiment loop	Retrieval/agent suites with recall@k, context recall, tool-call F1, goal accuracy, and near-miss fixtures.
IBM agentic engineering	Governance, human oversight, modular work, RAG grounding, CI review loops	SF enforces reviewable harness changes, evidence ledgers, and memory-backed evolution.

What Not To Borrow

Pattern	Why not
One-shot scaffolding as "done"	Repos drift. Harnesses need lifecycle management.
Judge-only pass/fail	Too risky for correctness, security, migrations, infra, and production behavior.
External SaaS as source of truth	sf should run locally and keep repo contracts in the repo.
Prompt-only evals	Software agents also mutate files, call tools, run commands, and change deployment risk.
Hidden ownership of generated files	Every file adoption must be explicit and reviewable.

Template Kit Shape

Each kit must answer:

What risk does it cover?
What repo signals make it applicable?
What files will it write?
What commands will it run?
What evidence blocks or passes?
What drift should make sf revisit it?

{
  "id": "rag-system",
  "appliesWhen": ["embeddings", "retrieval", "vector-db"],
  "writes": [
    "harness/manifest.json",
    "harness/evals/retrieval-recall.jsonl",
    "harness/evals/structural-near-misses.jsonl",
    "gates/rag-eval.sh"
  ],
  "commands": [
    {
      "id": "rag-eval",
      "command": "sf eval run harness/evals/retrieval-recall.jsonl",
      "blocks": true
    }
  ],
  "evolutionRules": [
    "When retrieval code changes, rerun recall suites.",
    "When a missed document causes a failure, add it as a near-miss case.",
    "When chunking changes, compare recall and answer faithfulness before promotion."
  ]
}

Core Kits

Agent Runtime

Use when the repo has agent loops, MCP tools, prompt routers, model providers, or autonomous workflows.

Files:

harness/evals/agent-tool-safety.jsonl
harness/evals/trajectory-goal-success.jsonl
harness/evals/injection-red-team.jsonl
gates/agent-fixture-replay.sh
gates/agent-red-team.sh

Metrics:

Fixture replay pass rate.
Tool-call exact match for required calls.
Tool-call F1 for flexible trajectories.
Forbidden tool call count.
Goal success judge score, advisory until calibrated.
Prompt injection refusal/containment.

RAG / Retrieval

Use when the repo retrieves docs, code, tickets, memories, vectors, or search results.

Files:

harness/evals/retrieval-recall.jsonl
harness/evals/context-faithfulness.jsonl
harness/evals/structural-near-misses.jsonl
gates/retrieval-eval.sh

Metrics:

recall@k.
MRR.
NDCG.
Context recall.
Context faithfulness.
Noise sensitivity.
Near-miss failure rate.

Important: do not optimize precision alone. A smaller, cleaner context that drops required evidence is a regression.

Web App

Use when the repo has browser-visible UI.

Files:

harness/web/smoke.spec.ts
harness/web/a11y.spec.ts
harness/web/visual.spec.ts
gates/web-smoke.sh

Metrics:

User workflow pass/fail.
Accessibility violations by severity.
Screenshot diff threshold.
Console error count.
Network failure count.
Core route performance budget.

Go / Windows Service

Use when the repo has Go services or Windows agents.

Files:

gates/go-test.sh
gates/go-vet.sh
gates/windows-build.sh
harness/contracts/windows-service.json

Metrics:

go test ./....
go vet ./....
GOOS=windows GOARCH=amd64 go build.
Service install/start/stop contract in a Windows-capable environment.

Nix Project

Use when the repo has flake.nix, shell.nix, or direnv policy.

Files:

gates/nix-flake-check.sh
gates/dev-shell-smoke.sh
harness/contracts/devshell.json

Metrics:

nix flake check.
Dev shell starts and exposes expected tool versions.
Direnv policy is explicit.
Build caches are used where configured.

Charm Service

Use when the repo has Go services using Charm libraries such as Wish, Bubble Tea, fantasy, or x/vcr.

Files:

gates/charm-go-test.sh
gates/wish-smoke.sh
harness/vcr/sessions/*.jsonl

Metrics:

SSH app starts and responds.
TUI smoke trace replays.
x/vcr recording can be replayed.
Metrics endpoint responds when promwish is enabled.

Eval Runner Contract

An SF eval run should produce:

{
  "schema": "sf.eval.result.v1",
  "suiteId": "agent-tool-safety",
  "runId": "01J...",
  "profileId": "01J...",
  "startedAt": 1770000000000,
  "endedAt": 1770000001200,
  "cases": 42,
  "passed": 40,
  "failed": 2,
  "score": 0.9523,
  "blocking": true,
  "judgePromptHash": "sha256:...",
  "provider": "openai:gpt-5-mini",
  "reportPath": ".sf/reports/evals/agent-tool-safety/01J....json"
}

Store the summary in .sf/sf.db. Store larger artifacts as files. Retain only durable lessons in Singularity Memory.

For model-judge cases, the runner must also store calibration metadata:

Rubric path and content hash.
Judge provider, model ID, temperature, and output schema version.
Calibration suite ID and held-out suite ID.
False-pass rate, false-block rate, precision/recall/F1, quorum disagreement, and rerun-stability summary.
Raw judge output reference for later audit.

Model-judge suites are advisory until their calibration metadata says they are eligible to block.

Singularity Memory Feedback Loop

Retain:

Harnesses that repeatedly passed and caught real regressions.
Failures that required human correction.
Judge rubrics that showed false positives or false negatives.
Repo-specific risk notes discovered during profiling.
Useful untracked observations, still marked as observations.

Recall:

Before planning harness changes.
Before dispatching high-risk work.
Before running model-judge evals.
Before deleting or replacing any generated harness file.

Feedback:

Positive when recalled memory helped a gate pass or prevented a repeat bug.
Negative when recalled memory was stale or led to a bad recommendation.
Validation when a memory still matches current repo evidence.

References

OpenAI Evals: https://github.com/openai/evals
promptfoo intro: https://www.promptfoo.dev/docs/intro/
promptfoo model-graded metrics: https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/
Ragas docs: https://docs.ragas.io/en/stable/
Backstage Software Templates: https://backstage.io/docs/features/software-templates/
IBM agentic engineering: https://www.ibm.com/think/topics/agentic-engineering

9.5 KiB Raw Blame History

Repo-native Harness Template Kits and Evaluation Research

Purpose

Summary

Add Now vs Later

What To Borrow

What Not To Borrow

Template Kit Shape

Core Kits

Agent Runtime

RAG / Retrieval

Web App

Go / Windows Service

Nix Project

Charm Service

Eval Runner Contract

Singularity Memory Feedback Loop

References

9.5 KiB

Raw Blame History