docs: plan judge calibration service
This commit is contained in:
parent
0d6eca9cdd
commit
93c1bbcb9a
4 changed files with 77 additions and 4 deletions
|
|
@ -101,6 +101,7 @@ These came up during recent ports and refactor passes — tracked here so they d
|
|||
| **Full swarm chat for `subagent` tool** | Round-robin debate mode now exists as `subagent({ mode: "debate", rounds: N, tasks: [...] })`, so adversarial reviewers can engage prior-round arguments. Remaining work is Option C from [ADR-011](docs/dev/ADR-011-swarm-chat-and-debate-mode.md): full inbox-based swarm chat after the persistent-agent layer (SPEC §17–18) lands. | 3 | ~3 weeks (depends on persistent-agent layer) |
|
||||
| **Singularity Knowledge + Agent Platform (Go re-platform)** | Re-platform Singularity Memory from Python+FastAPI+Postgres+vchord to Go on Charm: charm-server patterns for auth/identity, fantasy as agent runtime, same Postgres+vchord for retrieval, exact wire-contract preserved. Load-bearing for cross-instance knowledge federation AND future central persistent agents (sf SPEC §17). See [ADR-014](docs/dev/ADR-014-singularity-knowledge-and-agent-platform.md) and [`singularity-memory/MIGRATION.md`](https://github.com/singularity-ng/singularity-memory/blob/main/MIGRATION.md). | 1 | ~12 weeks across phases |
|
||||
| **Wire sf to Singularity Memory remote-mode** | sf-side: change `memory-store.ts` provider chain from local-SQLite-only to remote-Singularity-Memory → embedded → local-only fallback. Once wired, ~80% of the "should sf instances interlink?" question (ADR-012) is answered for free. Depends on the platform itself being live. | 1 | 1 week post-platform |
|
||||
| **Judge calibration + eval runner service** | Documentation-only for now. When implemented, keep SF core in TS for repo profiling and `.sf/sf.db` run ledgers, but build model-judge execution/calibration as a Go/Charm service using `fantasy`/`catwalk`, with durable false-positive/false-negative lessons retained into Singularity Memory. See [repo-native-harness-architecture.md](docs/dev/repo-native-harness-architecture.md#judge-rig). | 2 | ~2-3 weeks after Singularity Memory remote-mode |
|
||||
| **sf-worker SSH host** | Build the Go-based SSH worker host for distributed execution (SPEC §22, NEW): `wish` + `xpty`/`conpty` + `promwish`. Orchestrator dispatches over SSH; worker spawns the agent in a real pty per attempt; Prometheus metrics for free. See [ADR-013](docs/dev/ADR-013-network-and-remote-execution.md). | 2 | ~3 weeks |
|
||||
| **Charm TUI client (`sf-tui`)** | Build a new Go-based TUI client on `pony` + `ultraviolet` + `bubbles` + `lipgloss` + `glamour` + `huh` + `harmonica` + `x/mosaic`. Talks to sf daemon over RPC. Two-stage replacement of `pi-tui`: ship parallel as `sf --tui=charm`, reach parity, flip default, delete `pi-tui` (sheds ~10k LOC of TS from sf core). See [ADR-017](docs/dev/ADR-017-charm-tui-client.md). | 2 | ~12-16 weeks across stages |
|
||||
| **Flight recorder** (`x/vcr`) | Frame-accurate session recording for sf auto-loop dispatches. Go service using `charmbracelet/x/vcr`. Records to `.sf/recordings/{unit-id}.vcr`; `sf replay <unit-id>` opens TUI player. Frame-level redaction parity with `event-log.jsonl`. See [ADR-015](docs/dev/ADR-015-flight-recorder.md). | 3 | ~3 weeks |
|
||||
|
|
|
|||
|
|
@ -116,6 +116,12 @@ sf will define a judge rig with four evaluator classes:
|
|||
|
||||
High-risk suites require at least one deterministic or structural gate. Model judges can summarize, rank, and flag, but they do not replace executable evidence.
|
||||
|
||||
Implementation stance: deterministic and structural gates can stay in SF core,
|
||||
but model-judge execution and calibration should be a later Go/Charm service
|
||||
using `fantasy`/`catwalk`, with durable calibration lessons retained into
|
||||
Singularity Memory. This ADR documents the contract now; it does not enable a
|
||||
runtime judge service or prompt-behavior change yet.
|
||||
|
||||
## Architecture
|
||||
|
||||
The system is split into bounded components:
|
||||
|
|
@ -173,10 +179,11 @@ Detailed design is in `repo-native-harness-architecture.md`.
|
|||
|---|---|---|
|
||||
| 1 | Add repo profile snapshots and untracked observation model. | sf understands repo shape without taking ownership. |
|
||||
| 2 | Add template kit registry and harness manifest format. | sf can generate dry-run harness proposals without writing repo files. |
|
||||
| 3 | Add judge rig and eval suite runner. | AI and agent behavior becomes measurable. |
|
||||
| 3 | Add judge rig and eval suite runner. | AI and agent behavior becomes measurable; model judges remain advisory until calibrated. |
|
||||
| 4 | Connect evidence to Singularity Memory. | Patterns and anti-patterns improve future dispatch. |
|
||||
| 5 | Add drift detection and automatic harness update proposals. | Harnesses evolve with the repo as proposals. |
|
||||
| 6 | Add explicit opt-in Harness Writer. | Reviewed repo diffs can create tracked harness files; repo-local skills remain out of scope unless separately accepted. |
|
||||
| 5 | Add Go/Charm judge-calibration service. | Calibrated judge runs can be reused across repos and retained into Singularity Memory. |
|
||||
| 6 | Add drift detection and automatic harness update proposals. | Harnesses evolve with the repo as proposals. |
|
||||
| 7 | Add explicit opt-in Harness Writer. | Reviewed repo diffs can create tracked harness files; repo-local skills remain out of scope unless separately accepted. |
|
||||
|
||||
## References
|
||||
|
||||
|
|
|
|||
|
|
@ -255,6 +255,26 @@ The manifest is a tracked contract. `.sf/sf.db` stores run history for the manif
|
|||
|
||||
The judge rig follows one rule: deterministic evidence first, model judgment second.
|
||||
|
||||
### Implementation boundary
|
||||
|
||||
This is documented now; it is not part of the current repo-profiler slice.
|
||||
|
||||
Placement:
|
||||
|
||||
- SF core stays in TypeScript for repo profiling, harness proposal planning,
|
||||
project preferences/config, and `.sf/sf.db` run ledgers.
|
||||
- Deterministic and structural assertions can run locally from SF because they
|
||||
already map to commands, AST checks, schemas, and git/diff checks.
|
||||
- Model-judge execution and calibration should be a future Go/Charm service,
|
||||
not another TS subsystem. Use `fantasy`/`catwalk` for model/provider routing,
|
||||
Go HTTP/MCP APIs for SF integration, and `promwish`-style metrics when it is
|
||||
daemonized.
|
||||
- Durable calibration lessons belong in Singularity Memory. Local `.sf/sf.db`
|
||||
stores run IDs, rubric hashes, model IDs, scores, raw output references, and
|
||||
pass/fail summaries.
|
||||
- Repo-local custom skills remain out of scope. Repo-specific eval suites or
|
||||
harness files are later opt-in proposals only.
|
||||
|
||||
### Case format
|
||||
|
||||
```json
|
||||
|
|
@ -302,6 +322,34 @@ Before a model judge can block:
|
|||
|
||||
For high-risk agent or RAG gates, use either deterministic metrics or a judge quorum. A single uncalibrated model opinion is never enough.
|
||||
|
||||
Calibration lifecycle:
|
||||
|
||||
1. Build a golden set with known pass, fail, and ambiguous examples from real
|
||||
bugs, traces, PR reviews, bad retrievals, prompt-injection attempts, and good
|
||||
outputs.
|
||||
2. Split it into calibration and held-out suites. Tune rubrics only against the
|
||||
calibration suite.
|
||||
3. Pin the judge provider, model ID, temperature, output schema, rubric file,
|
||||
and rubric hash.
|
||||
4. Measure false-pass rate, false-block rate, precision/recall/F1 for the
|
||||
failure class, schema validity, quorum disagreement, and rerun stability.
|
||||
5. Keep the judge advisory until the held-out suite meets the threshold for the
|
||||
risk family.
|
||||
6. Promote to blocking only with either a deterministic/structural companion
|
||||
gate or a calibrated judge quorum.
|
||||
7. Recalibrate when the rubric, judge model, provider, prompt wrapper, eval case
|
||||
schema, or target workflow changes.
|
||||
|
||||
Default promotion bar:
|
||||
|
||||
- Critical/security gates: zero false passes on held-out critical failures, plus
|
||||
deterministic or structural companion evidence.
|
||||
- Product-quality gates: false-block rate low enough that developers do not
|
||||
route around the gate; judge remains advisory if noisy.
|
||||
- RAG/agent metrics: calibrated thresholds for recall@k, MRR/NDCG,
|
||||
context-recall, tool-call F1, or trajectory success; model rubrics explain
|
||||
failures but do not replace the metric.
|
||||
|
||||
## Singularity Memory Integration
|
||||
|
||||
Pre-dispatch:
|
||||
|
|
|
|||
|
|
@ -25,10 +25,15 @@ sf's improvement is that the harness is repo-native and evolving. It is not a on
|
|||
| Now | Template kit contract, untracked observation policy, judge rig, eval runner result shape, memory feedback loop. | No runtime prompt changes yet. |
|
||||
| First implementation | Profiler output schema and harness manifest schema. | Read-only profiling at session start or `sf init`. |
|
||||
| Next | Concrete kits for `agent-runtime`, `rag-system`, `web-app`, `go-service`, and `nix-project`. | Manifest-driven gate/eval runner in verify phase. |
|
||||
| Later | Calibration policy for model judges and drift policy for harness evolution. | Prompt injection of recalled harness lessons and automatic harness-update proposals. |
|
||||
| Later | Calibration policy for model judges and drift policy for harness evolution. | Go/Charm judge-calibration service, prompt injection of recalled harness lessons, and automatic harness-update proposals. |
|
||||
|
||||
This keeps the current docs ahead of implementation while avoiding a hidden behavior change in the agent loop.
|
||||
|
||||
The model-judge runner is intentionally not implemented in this documentation
|
||||
slice. When it is built, it should be a Go/Charm service adjacent to the
|
||||
Singularity Memory platform, not a repo-local skill pack and not a hidden SF
|
||||
core behavior change.
|
||||
|
||||
## What To Borrow
|
||||
|
||||
| Source | Useful pattern | SF adaptation |
|
||||
|
|
@ -230,6 +235,18 @@ An SF eval run should produce:
|
|||
|
||||
Store the summary in `.sf/sf.db`. Store larger artifacts as files. Retain only durable lessons in Singularity Memory.
|
||||
|
||||
For model-judge cases, the runner must also store calibration metadata:
|
||||
|
||||
- Rubric path and content hash.
|
||||
- Judge provider, model ID, temperature, and output schema version.
|
||||
- Calibration suite ID and held-out suite ID.
|
||||
- False-pass rate, false-block rate, precision/recall/F1, quorum disagreement,
|
||||
and rerun-stability summary.
|
||||
- Raw judge output reference for later audit.
|
||||
|
||||
Model-judge suites are advisory until their calibration metadata says they are
|
||||
eligible to block.
|
||||
|
||||
## Singularity Memory Feedback Loop
|
||||
|
||||
Retain:
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue