docs: plan judge calibration service

2026-04-29 18:28:45 +02:00 · 2026-04-29 18:28:45 +02:00 · 93c1bbcb9a
commit 93c1bbcb9a
parent 0d6eca9cdd
4 changed files with 77 additions and 4 deletions
--- a/BUILD_PLAN.md
+++ b/BUILD_PLAN.md
@ -101,6 +101,7 @@ These came up during recent ports and refactor passes — tracked here so they d
 | **Full swarm chat for `subagent` tool** | Round-robin debate mode now exists as `subagent({ mode: "debate", rounds: N, tasks: [...] })`, so adversarial reviewers can engage prior-round arguments. Remaining work is Option C from [ADR-011](docs/dev/ADR-011-swarm-chat-and-debate-mode.md): full inbox-based swarm chat after the persistent-agent layer (SPEC §17–18) lands. | 3 | ~3 weeks (depends on persistent-agent layer) |
 | **Singularity Knowledge + Agent Platform (Go re-platform)** | Re-platform Singularity Memory from Python+FastAPI+Postgres+vchord to Go on Charm: charm-server patterns for auth/identity, fantasy as agent runtime, same Postgres+vchord for retrieval, exact wire-contract preserved. Load-bearing for cross-instance knowledge federation AND future central persistent agents (sf SPEC §17). See [ADR-014](docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md) and [`singularity-memory/MIGRATION.md`](https://github.com/singularity-ng/singularity-memory/blob/main/MIGRATION.md). | 1 | ~12 weeks across phases |
 | **Wire sf to Singularity Memory remote-mode** | sf-side: change `memory-store.ts` provider chain from local-SQLite-only to remote-Singularity-Memory → embedded → local-only fallback. Once wired, ~80% of the "should sf instances interlink?" question (ADR-012) is answered for free. Depends on the platform itself being live. | 1 | 1 week post-platform |
+| **Judge calibration + eval runner service** | Documentation-only for now. When implemented, keep SF core in TS for repo profiling and `.sf/sf.db` run ledgers, but build model-judge execution/calibration as a Go/Charm service using `fantasy`/`catwalk`, with durable false-positive/false-negative lessons retained into Singularity Memory. See [repo-native-harness-architecture.md](docs/dev/repo-native-harness-architecture.md#judge-rig). | 2 | ~2-3 weeks after Singularity Memory remote-mode |
 | **sf-worker SSH host** | Build the Go-based SSH worker host for distributed execution (SPEC §22, NEW): `wish` + `xpty`/`conpty` + `promwish`. Orchestrator dispatches over SSH; worker spawns the agent in a real pty per attempt; Prometheus metrics for free. See [ADR-013](docs/dev/ADR-013-network-and-remote-execution.md). | 2 | ~3 weeks |
 | **Charm TUI client (`sf-tui`)** | Build a new Go-based TUI client on `pony` + `ultraviolet` + `bubbles` + `lipgloss` + `glamour` + `huh` + `harmonica` + `x/mosaic`. Talks to sf daemon over RPC. Two-stage replacement of `pi-tui`: ship parallel as `sf --tui=charm`, reach parity, flip default, delete `pi-tui` (sheds ~10k LOC of TS from sf core). See [ADR-017](docs/dev/ADR-017-charm-tui-client.md). | 2 | ~12-16 weeks across stages |
 | **Flight recorder** (`x/vcr`) | Frame-accurate session recording for sf auto-loop dispatches. Go service using `charmbracelet/x/vcr`. Records to `.sf/recordings/{unit-id}.vcr`; `sf replay <unit-id>` opens TUI player. Frame-level redaction parity with `event-log.jsonl`. See [ADR-015](docs/dev/ADR-015-flight-recorder.md). | 3 | ~3 weeks |
--- a/docs/dev/ADR-018-repo-native-harness-evolution.md
+++ b/docs/dev/ADR-018-repo-native-harness-evolution.md
@ -116,6 +116,12 @@ sf will define a judge rig with four evaluator classes:

 High-risk suites require at least one deterministic or structural gate. Model judges can summarize, rank, and flag, but they do not replace executable evidence.

+Implementation stance: deterministic and structural gates can stay in SF core,
+but model-judge execution and calibration should be a later Go/Charm service
+using `fantasy`/`catwalk`, with durable calibration lessons retained into
+Singularity Memory. This ADR documents the contract now; it does not enable a
+runtime judge service or prompt-behavior change yet.
+
 ## Architecture

 The system is split into bounded components:
@ -173,10 +179,11 @@ Detailed design is in `repo-native-harness-architecture.md`.
 |---|---|---|
 | 1 | Add repo profile snapshots and untracked observation model. | sf understands repo shape without taking ownership. |
 | 2 | Add template kit registry and harness manifest format. | sf can generate dry-run harness proposals without writing repo files. |
-| 3 | Add judge rig and eval suite runner. | AI and agent behavior becomes measurable. |
+| 3 | Add judge rig and eval suite runner. | AI and agent behavior becomes measurable; model judges remain advisory until calibrated. |
 | 4 | Connect evidence to Singularity Memory. | Patterns and anti-patterns improve future dispatch. |
-| 5 | Add drift detection and automatic harness update proposals. | Harnesses evolve with the repo as proposals. |
-| 6 | Add explicit opt-in Harness Writer. | Reviewed repo diffs can create tracked harness files; repo-local skills remain out of scope unless separately accepted. |
+| 5 | Add Go/Charm judge-calibration service. | Calibrated judge runs can be reused across repos and retained into Singularity Memory. |
+| 6 | Add drift detection and automatic harness update proposals. | Harnesses evolve with the repo as proposals. |
+| 7 | Add explicit opt-in Harness Writer. | Reviewed repo diffs can create tracked harness files; repo-local skills remain out of scope unless separately accepted. |

 ## References

--- a/docs/dev/repo-native-harness-architecture.md
+++ b/docs/dev/repo-native-harness-architecture.md
@ -255,6 +255,26 @@ The manifest is a tracked contract. `.sf/sf.db` stores run history for the manif

 The judge rig follows one rule: deterministic evidence first, model judgment second.

+### Implementation boundary
+
+This is documented now; it is not part of the current repo-profiler slice.
+
+Placement:
+
+- SF core stays in TypeScript for repo profiling, harness proposal planning,
+  project preferences/config, and `.sf/sf.db` run ledgers.
+- Deterministic and structural assertions can run locally from SF because they
+  already map to commands, AST checks, schemas, and git/diff checks.
+- Model-judge execution and calibration should be a future Go/Charm service,
+  not another TS subsystem. Use `fantasy`/`catwalk` for model/provider routing,
+  Go HTTP/MCP APIs for SF integration, and `promwish`-style metrics when it is
+  daemonized.
+- Durable calibration lessons belong in Singularity Memory. Local `.sf/sf.db`
+  stores run IDs, rubric hashes, model IDs, scores, raw output references, and
+  pass/fail summaries.
+- Repo-local custom skills remain out of scope. Repo-specific eval suites or
+  harness files are later opt-in proposals only.
+
 ### Case format

 ```json
@ -302,6 +322,34 @@ Before a model judge can block:

 For high-risk agent or RAG gates, use either deterministic metrics or a judge quorum. A single uncalibrated model opinion is never enough.

+Calibration lifecycle:
+
+1. Build a golden set with known pass, fail, and ambiguous examples from real
+   bugs, traces, PR reviews, bad retrievals, prompt-injection attempts, and good
+   outputs.
+2. Split it into calibration and held-out suites. Tune rubrics only against the
+   calibration suite.
+3. Pin the judge provider, model ID, temperature, output schema, rubric file,
+   and rubric hash.
+4. Measure false-pass rate, false-block rate, precision/recall/F1 for the
+   failure class, schema validity, quorum disagreement, and rerun stability.
+5. Keep the judge advisory until the held-out suite meets the threshold for the
+   risk family.
+6. Promote to blocking only with either a deterministic/structural companion
+   gate or a calibrated judge quorum.
+7. Recalibrate when the rubric, judge model, provider, prompt wrapper, eval case
+   schema, or target workflow changes.
+
+Default promotion bar:
+
+- Critical/security gates: zero false passes on held-out critical failures, plus
+  deterministic or structural companion evidence.
+- Product-quality gates: false-block rate low enough that developers do not
+  route around the gate; judge remains advisory if noisy.
+- RAG/agent metrics: calibrated thresholds for recall@k, MRR/NDCG,
+  context-recall, tool-call F1, or trajectory success; model rubrics explain
+  failures but do not replace the metric.
+
 ## Singularity Memory Integration

 Pre-dispatch:
--- a/docs/dev/repo-native-harness-template-kits.md
+++ b/docs/dev/repo-native-harness-template-kits.md
@ -25,10 +25,15 @@ sf's improvement is that the harness is repo-native and evolving. It is not a on
 | Now | Template kit contract, untracked observation policy, judge rig, eval runner result shape, memory feedback loop. | No runtime prompt changes yet. |
 | First implementation | Profiler output schema and harness manifest schema. | Read-only profiling at session start or `sf init`. |
 | Next | Concrete kits for `agent-runtime`, `rag-system`, `web-app`, `go-service`, and `nix-project`. | Manifest-driven gate/eval runner in verify phase. |
-| Later | Calibration policy for model judges and drift policy for harness evolution. | Prompt injection of recalled harness lessons and automatic harness-update proposals. |
+| Later | Calibration policy for model judges and drift policy for harness evolution. | Go/Charm judge-calibration service, prompt injection of recalled harness lessons, and automatic harness-update proposals. |

 This keeps the current docs ahead of implementation while avoiding a hidden behavior change in the agent loop.

+The model-judge runner is intentionally not implemented in this documentation
+slice. When it is built, it should be a Go/Charm service adjacent to the
+Singularity Memory platform, not a repo-local skill pack and not a hidden SF
+core behavior change.
+
 ## What To Borrow

 | Source | Useful pattern | SF adaptation |
@ -230,6 +235,18 @@ An SF eval run should produce:

 Store the summary in `.sf/sf.db`. Store larger artifacts as files. Retain only durable lessons in Singularity Memory.

+For model-judge cases, the runner must also store calibration metadata:
+
+- Rubric path and content hash.
+- Judge provider, model ID, temperature, and output schema version.
+- Calibration suite ID and held-out suite ID.
+- False-pass rate, false-block rate, precision/recall/F1, quorum disagreement,
+  and rerun-stability summary.
+- Raw judge output reference for later audit.
+
+Model-judge suites are advisory until their calibration metadata says they are
+eligible to block.
+
 ## Singularity Memory Feedback Loop

 Retain: