From 93c1bbcb9ad7d762df0bada75231aa9952f7aa89 Mon Sep 17 00:00:00 2001 From: Mikael Hugo Date: Wed, 29 Apr 2026 18:28:45 +0200 Subject: [PATCH] docs: plan judge calibration service --- BUILD_PLAN.md | 1 + .../ADR-018-repo-native-harness-evolution.md | 13 +++-- docs/dev/repo-native-harness-architecture.md | 48 +++++++++++++++++++ docs/dev/repo-native-harness-template-kits.md | 19 +++++++- 4 files changed, 77 insertions(+), 4 deletions(-) diff --git a/BUILD_PLAN.md b/BUILD_PLAN.md index e254f9deb..b8ac62acb 100644 --- a/BUILD_PLAN.md +++ b/BUILD_PLAN.md @@ -101,6 +101,7 @@ These came up during recent ports and refactor passes — tracked here so they d | **Full swarm chat for `subagent` tool** | Round-robin debate mode now exists as `subagent({ mode: "debate", rounds: N, tasks: [...] })`, so adversarial reviewers can engage prior-round arguments. Remaining work is Option C from [ADR-011](docs/dev/ADR-011-swarm-chat-and-debate-mode.md): full inbox-based swarm chat after the persistent-agent layer (SPEC §17–18) lands. | 3 | ~3 weeks (depends on persistent-agent layer) | | **Singularity Knowledge + Agent Platform (Go re-platform)** | Re-platform Singularity Memory from Python+FastAPI+Postgres+vchord to Go on Charm: charm-server patterns for auth/identity, fantasy as agent runtime, same Postgres+vchord for retrieval, exact wire-contract preserved. Load-bearing for cross-instance knowledge federation AND future central persistent agents (sf SPEC §17). See [ADR-014](docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md) and [`singularity-memory/MIGRATION.md`](https://github.com/singularity-ng/singularity-memory/blob/main/MIGRATION.md). | 1 | ~12 weeks across phases | | **Wire sf to Singularity Memory remote-mode** | sf-side: change `memory-store.ts` provider chain from local-SQLite-only to remote-Singularity-Memory → embedded → local-only fallback. Once wired, ~80% of the "should sf instances interlink?" question (ADR-012) is answered for free. Depends on the platform itself being live. | 1 | 1 week post-platform | +| **Judge calibration + eval runner service** | Documentation-only for now. When implemented, keep SF core in TS for repo profiling and `.sf/sf.db` run ledgers, but build model-judge execution/calibration as a Go/Charm service using `fantasy`/`catwalk`, with durable false-positive/false-negative lessons retained into Singularity Memory. See [repo-native-harness-architecture.md](docs/dev/repo-native-harness-architecture.md#judge-rig). | 2 | ~2-3 weeks after Singularity Memory remote-mode | | **sf-worker SSH host** | Build the Go-based SSH worker host for distributed execution (SPEC §22, NEW): `wish` + `xpty`/`conpty` + `promwish`. Orchestrator dispatches over SSH; worker spawns the agent in a real pty per attempt; Prometheus metrics for free. See [ADR-013](docs/dev/ADR-013-network-and-remote-execution.md). | 2 | ~3 weeks | | **Charm TUI client (`sf-tui`)** | Build a new Go-based TUI client on `pony` + `ultraviolet` + `bubbles` + `lipgloss` + `glamour` + `huh` + `harmonica` + `x/mosaic`. Talks to sf daemon over RPC. Two-stage replacement of `pi-tui`: ship parallel as `sf --tui=charm`, reach parity, flip default, delete `pi-tui` (sheds ~10k LOC of TS from sf core). See [ADR-017](docs/dev/ADR-017-charm-tui-client.md). | 2 | ~12-16 weeks across stages | | **Flight recorder** (`x/vcr`) | Frame-accurate session recording for sf auto-loop dispatches. Go service using `charmbracelet/x/vcr`. Records to `.sf/recordings/{unit-id}.vcr`; `sf replay ` opens TUI player. Frame-level redaction parity with `event-log.jsonl`. See [ADR-015](docs/dev/ADR-015-flight-recorder.md). | 3 | ~3 weeks | diff --git a/docs/dev/ADR-018-repo-native-harness-evolution.md b/docs/dev/ADR-018-repo-native-harness-evolution.md index 714fc66c3..350bd72e3 100644 --- a/docs/dev/ADR-018-repo-native-harness-evolution.md +++ b/docs/dev/ADR-018-repo-native-harness-evolution.md @@ -116,6 +116,12 @@ sf will define a judge rig with four evaluator classes: High-risk suites require at least one deterministic or structural gate. Model judges can summarize, rank, and flag, but they do not replace executable evidence. +Implementation stance: deterministic and structural gates can stay in SF core, +but model-judge execution and calibration should be a later Go/Charm service +using `fantasy`/`catwalk`, with durable calibration lessons retained into +Singularity Memory. This ADR documents the contract now; it does not enable a +runtime judge service or prompt-behavior change yet. + ## Architecture The system is split into bounded components: @@ -173,10 +179,11 @@ Detailed design is in `repo-native-harness-architecture.md`. |---|---|---| | 1 | Add repo profile snapshots and untracked observation model. | sf understands repo shape without taking ownership. | | 2 | Add template kit registry and harness manifest format. | sf can generate dry-run harness proposals without writing repo files. | -| 3 | Add judge rig and eval suite runner. | AI and agent behavior becomes measurable. | +| 3 | Add judge rig and eval suite runner. | AI and agent behavior becomes measurable; model judges remain advisory until calibrated. | | 4 | Connect evidence to Singularity Memory. | Patterns and anti-patterns improve future dispatch. | -| 5 | Add drift detection and automatic harness update proposals. | Harnesses evolve with the repo as proposals. | -| 6 | Add explicit opt-in Harness Writer. | Reviewed repo diffs can create tracked harness files; repo-local skills remain out of scope unless separately accepted. | +| 5 | Add Go/Charm judge-calibration service. | Calibrated judge runs can be reused across repos and retained into Singularity Memory. | +| 6 | Add drift detection and automatic harness update proposals. | Harnesses evolve with the repo as proposals. | +| 7 | Add explicit opt-in Harness Writer. | Reviewed repo diffs can create tracked harness files; repo-local skills remain out of scope unless separately accepted. | ## References diff --git a/docs/dev/repo-native-harness-architecture.md b/docs/dev/repo-native-harness-architecture.md index 126a5ee58..03ecfbf57 100644 --- a/docs/dev/repo-native-harness-architecture.md +++ b/docs/dev/repo-native-harness-architecture.md @@ -255,6 +255,26 @@ The manifest is a tracked contract. `.sf/sf.db` stores run history for the manif The judge rig follows one rule: deterministic evidence first, model judgment second. +### Implementation boundary + +This is documented now; it is not part of the current repo-profiler slice. + +Placement: + +- SF core stays in TypeScript for repo profiling, harness proposal planning, + project preferences/config, and `.sf/sf.db` run ledgers. +- Deterministic and structural assertions can run locally from SF because they + already map to commands, AST checks, schemas, and git/diff checks. +- Model-judge execution and calibration should be a future Go/Charm service, + not another TS subsystem. Use `fantasy`/`catwalk` for model/provider routing, + Go HTTP/MCP APIs for SF integration, and `promwish`-style metrics when it is + daemonized. +- Durable calibration lessons belong in Singularity Memory. Local `.sf/sf.db` + stores run IDs, rubric hashes, model IDs, scores, raw output references, and + pass/fail summaries. +- Repo-local custom skills remain out of scope. Repo-specific eval suites or + harness files are later opt-in proposals only. + ### Case format ```json @@ -302,6 +322,34 @@ Before a model judge can block: For high-risk agent or RAG gates, use either deterministic metrics or a judge quorum. A single uncalibrated model opinion is never enough. +Calibration lifecycle: + +1. Build a golden set with known pass, fail, and ambiguous examples from real + bugs, traces, PR reviews, bad retrievals, prompt-injection attempts, and good + outputs. +2. Split it into calibration and held-out suites. Tune rubrics only against the + calibration suite. +3. Pin the judge provider, model ID, temperature, output schema, rubric file, + and rubric hash. +4. Measure false-pass rate, false-block rate, precision/recall/F1 for the + failure class, schema validity, quorum disagreement, and rerun stability. +5. Keep the judge advisory until the held-out suite meets the threshold for the + risk family. +6. Promote to blocking only with either a deterministic/structural companion + gate or a calibrated judge quorum. +7. Recalibrate when the rubric, judge model, provider, prompt wrapper, eval case + schema, or target workflow changes. + +Default promotion bar: + +- Critical/security gates: zero false passes on held-out critical failures, plus + deterministic or structural companion evidence. +- Product-quality gates: false-block rate low enough that developers do not + route around the gate; judge remains advisory if noisy. +- RAG/agent metrics: calibrated thresholds for recall@k, MRR/NDCG, + context-recall, tool-call F1, or trajectory success; model rubrics explain + failures but do not replace the metric. + ## Singularity Memory Integration Pre-dispatch: diff --git a/docs/dev/repo-native-harness-template-kits.md b/docs/dev/repo-native-harness-template-kits.md index 19a53ead2..738f361e5 100644 --- a/docs/dev/repo-native-harness-template-kits.md +++ b/docs/dev/repo-native-harness-template-kits.md @@ -25,10 +25,15 @@ sf's improvement is that the harness is repo-native and evolving. It is not a on | Now | Template kit contract, untracked observation policy, judge rig, eval runner result shape, memory feedback loop. | No runtime prompt changes yet. | | First implementation | Profiler output schema and harness manifest schema. | Read-only profiling at session start or `sf init`. | | Next | Concrete kits for `agent-runtime`, `rag-system`, `web-app`, `go-service`, and `nix-project`. | Manifest-driven gate/eval runner in verify phase. | -| Later | Calibration policy for model judges and drift policy for harness evolution. | Prompt injection of recalled harness lessons and automatic harness-update proposals. | +| Later | Calibration policy for model judges and drift policy for harness evolution. | Go/Charm judge-calibration service, prompt injection of recalled harness lessons, and automatic harness-update proposals. | This keeps the current docs ahead of implementation while avoiding a hidden behavior change in the agent loop. +The model-judge runner is intentionally not implemented in this documentation +slice. When it is built, it should be a Go/Charm service adjacent to the +Singularity Memory platform, not a repo-local skill pack and not a hidden SF +core behavior change. + ## What To Borrow | Source | Useful pattern | SF adaptation | @@ -230,6 +235,18 @@ An SF eval run should produce: Store the summary in `.sf/sf.db`. Store larger artifacts as files. Retain only durable lessons in Singularity Memory. +For model-judge cases, the runner must also store calibration metadata: + +- Rubric path and content hash. +- Judge provider, model ID, temperature, and output schema version. +- Calibration suite ID and held-out suite ID. +- False-pass rate, false-block rate, precision/recall/F1, quorum disagreement, + and rerun-stability summary. +- Raw judge output reference for later audit. + +Model-judge suites are advisory until their calibration metadata says they are +eligible to block. + ## Singularity Memory Feedback Loop Retain: