After cherry-picking the swarm commits the migration file had v72
declared before v70/v71 — when applied to a v69 DB the loop ran v72
first, set appliedVersion=72, and the v70/v71 guards `if
(appliedVersion < 70)` then `< 71` short-circuited so neither
ALTER ran on legacy DBs. Reordered so the file flows v70 → v71 → v72
matching version numbers; idempotent column probes on fresh DBs
still pass.
Verified: full sf-db-migration suite 13/13 green, including the
v52-and-v27 legacy-fixture paths that exercise the migration ladder
end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SF is a purpose-to-software compiler — every self_feedback row must name
the milestone vision or slice goal it's filed against, so triage can
prioritize against purpose rather than treating each row as floating.
- Schema v71 ALTERs self_feedback ADD COLUMN purpose_anchor TEXT.
NULL allowed for legacy rows; fresh-DB CREATE includes the column.
- sf-db-self-feedback.js: insertSelfFeedbackEntry accepts purposeAnchor
(camelCase), stored as :purpose_anchor; listSelfFeedbackEntries({purpose})
pushes a LIKE %fragment% filter into the DB layer so triage doesn't
have to pull the full table.
- rowToSelfFeedback exposes purposeAnchor, falling back to the JSON
projection for legacy rows where the column is NULL.
- headless-feedback CLI: `feedback add --purpose <fragment>` persists
the anchor; `feedback list --purpose <fragment>` filters by it.
Omission stays valid — restoration is additive, not breaking.
- help-text + migration test updated; new vitest covers add/list
round-trip, NULL-on-omit legacy compat, substring match, and the
help-text documentation contract.
Restores the doctrine in docs/adr/0000-purpose-to-software-compiler.md:
"non-trivial artifacts must name their purpose and consumer."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restore the purpose-to-software doctrine at the slice gate: every task
the executor closes must name the slice-goal sentence or clause it
served. complete-slice now refuses to flip a slice to complete while
any of its tasks has a NULL purpose_trace, making "did all tasks
actually serve the slice goal" a mechanical check instead of a vibe.
Schema migration v70 adds a nullable purpose_trace TEXT to tasks
(legacy rows stay valid). complete_task refuses without it and quotes
slice.goal in the error so the agent can anchor. insertTask /
updateTaskStatus accept the new field, rowToTask exposes it, and a
new updateTaskPurposeTrace helper covers later corrections.
Restoration of doctrine — see docs/adr/0000-purpose-to-software-compiler.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restoration of doctrine: plan-milestone now emits a literal milestone.vision
clause per slice (traces_vision_fragment) so validate-milestone has structured
grounds for assessment instead of re-reading the vision through the LLM every
time. Schema v69 adds the column (NULL allowed for legacy rows); the prompt and
plan_milestone tool start requiring it for new slices, rejecting fragments that
do not appear verbatim in milestone.vision. See docs/adr/0000-purpose-to-software-compiler.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restores the eight-PDD purpose gate at the autonomous-loop boundary
required by ADR-0000 (SF is a purpose-to-software compiler). The gate
walks milestone vision -> slice.traces_vision_fragment ->
task.purpose_trace before every dispatch and refuses to proceed when
the purpose chain is broken at the vision root (degraded-vision).
- New uok/purpose-coherence.js with a pure verdict function and a
DB-backed adapter. Reads vision/trace columns directly via SQL so
pre-P2/P3 schema migrations are tolerated.
- Wired into auto/phases-pre-dispatch.js alongside resource-version-
guard, pre-dispatch-health-gate, and planning-flow-gate. Fires on
every pre-dispatch turn and emits to the existing trace JSONL.
- Outcome ladder: fail (vision missing -> pause loop), warn (trace
columns missing or NULL -> surface but allow dispatch so legacy DBs
don't hard-break on day one), pass (full chain).
- Tests in tests/uok-purpose-coherence.test.mjs cover the four
contracted states plus the column-missing downgrade path on a
pre-migration schema.
Refs: ADR-0000.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two new doctor checks to checkEngineHealth():
- db_milestone_missing_vision: error when a milestone has no vision
(the WHY/purpose field per ADR-0000)
- db_slice_missing_goal: error when a slice has no goal
(the WHAT/purpose field per ADR-0000)
Both checks are non-fixable (the operator must define purpose).
This aligns with ADR-0000 §Enforcement: "Non-trivial milestones,
slices, tasks, ADRs, specs, tests, and exported symbols must name
their purpose and consumer."
Tests: 2 cases — milestone without vision flagged, slice without
goal flagged.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restoration of forgotten doctrine: ADR-0000 declares the eight PDD
fields (Purpose, Consumer, Contract, Failure boundary, Evidence,
Non-goals, Invariants, Assumptions) the purpose gate, but
`sf headless new-milestone --context <file>` was accepting any
context including empty or trivially-thin seed docs. This wires a
pre-create check that refuses the run when fields are missing or
too thin, naming exactly which ones so the operator can fix the
seed doc and retry.
- new src/resources/extensions/sf/headless-pdd-check.js: scans
context for the eight fields (heading and inline-label forms) and
reports missing/sparse, plus a minimum-spine check (Purpose +
Consumer + Contract + Evidence-or-Falsifier).
- src/headless.ts calls the check after loadContext, before
bootstrapping .sf/. Refusal exits 1 with formatPddRefusal text.
- --skip-pdd-check is the migration escape hatch (warning printed,
PDD gate bypassed) for milestones that pre-date the gate.
- SF-internal auto-bootstrap (autonomous→new-milestone fallback)
is exempted because the seed is SF-generated, not operator-PDD.
- vitest test covers missing-Purpose, missing-Consumer, all-8,
sparse, inline-label form, Falsifier-as-Evidence spine, and the
doctrine field order.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: dr-repo M003 had all 8 owning requirements (UNI-01..05,
PIL-01..03) marked Status: complete in .sf/REQUIREMENTS.md, but
the milestone row was still active because its only slice was a
post-migration skipped placeholder. After the previous fix routed
all-skipped milestones to pre-planning, SF ran roadmap-meeting +
plan-milestone and wrote 3 new slices on a milestone whose
contract-level work was already done — burned ~4 LLM turns on
plausibly-adjacent but unwanted re-decomposition.
Root cause: deriveStateFromDb's milestone-completion gate consults
only slice statuses (and indirectly the milestone row's own status
field). It never reads REQUIREMENTS.md to check whether the
contract is already satisfied. The slice-based view collapsed the
real signal.
Fix:
- New parseRequirementsByMilestone(content) helper in files.js:
parses REQUIREMENTS.md, groups entries by their `Primary owning
milestone` field, returns Map<id, {complete, incomplete}>.
- handleAllSlicesDone now reads REQUIREMENTS.md before its
slice-based real-work check. If a milestone has at least one
owning requirement and zero of them are incomplete, route to
completing-milestone with nextAction naming the requirement count
(so the operator can see *why* the milestone is being closed
without manually opening REQUIREMENTS.md).
- Best-effort: REQUIREMENTS.md parse failure falls through to the
existing slice-based rule. Missing file likewise — no regression
for projects that don't keep a requirements file.
Resolves sf-mp74hftw-zud6ba filed via the headless feedback CLI.
End-to-end verified by re-running sf headless query on dr-repo
M003: now reports phase=completing-milestone with the right
requirement-count message.
Tests: 5 new cases — all complete + slice skipped → completing,
some active → pre-planning, zero owning requirements falls through,
missing file falls through, all complete + real slice work still
completes. Existing 4 all-skipped-replan cases still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
T01: Added integration test auto-halt-self-feedback.test.mjs that proves:
- HaltWatchdog.check() creates a self-feedback DB entry with
kind=runaway-loop:idle-halt, severity=high, blocking=true
- Markdown projection (.sf/SELF-FEEDBACK.md) is regenerated
- Deduplication works (one entry per idle period)
- New heartbeat resets and creates a new entry for the next idle period
T02: Enhanced evidence string to include elapsedMs, iteration, and
thresholdMs explicitly (R003 actionable context requirement).
Tests: 36/36 pass across auto-halt-self-feedback,
auto-halt-watchdog-notify, and self-feedback-db suites.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
handleAllSlicesDone treated isStatusDone uniformly — "complete",
"done", AND "skipped" all counted as "milestone work is finished",
so a milestone whose only slice was skipped would advance to
phase=validating-milestone. That's wrong: a placeholder slice that
was skipped doesn't validate the milestone's success criteria, it
just clears the wedge.
Surfaced concretely in dr-repo M003 (Unified Dashboard + Pilot
Validation): I skipped the migration placeholder via the new
`sf headless skip-slice` CLI, and the next-dispatch reported
`validate-milestone M003` even though no real work had happened on
the milestone. The autonomous loop would then burn an LLM turn
running validate-milestone just to discover the obvious gap.
Fix: differentiate {complete, done} from {skipped} at the gate.
When zero slices carry real-work outcomes, route into the
pre-planning phase so the dispatcher's existing
discuss → research → plan ladder takes over. The PDD/vision is
already in the milestone row, so the planner has the purpose it
needs without operator hand-holding.
Verified end-to-end against dr-repo: `sf headless query` for M003
now reports phase=pre-planning and next dispatch
`roadmap-meeting M003` (the deep-planning entry rule fires first;
discuss/research/plan come after as artifacts land).
Tests: 4 cases — all-skipped → pre-planning, complete+skipped mix
→ validating, legacy "done" alias → validating, multiple skipped
→ pre-planning.
Resolves sf-mp73sk0m-63w88y (filed via headless feedback CLI).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Memory injection telemetry:
- Move counter writes from auto-prompts.js to memory-store.js (where
getRelevantMemoriesRanked/getActiveMemoriesRanked actually fire).
- Track memory_inject_count and memory_inject_chars_total via
runtime_counters table for headless-query reporting.
State-db validation:
- handleAllSlicesDone now checks if any slice carries real work
(status=complete/done) before routing to validation.
- Milestones with all-skipped slices route to "reassess-roadmap"
instead of asking the operator to validate non-existent work.
SM client defense:
- Filter foreign-tenant memories from SM query responses even when
the server returns them (defense-in-depth).
Tests updated: memory-extraction-lifecycle, sf-db-migration,
headless-query-memory-injection, sm-client, memory-tenant-gate.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closes sf-mp723nju-2cpeoc. When SM_ENABLED is on, memory retrieval from
Singularity Memory is now scoped to the current project's repoIdentity
tenant. Foreign-tenant memories are filtered client-side and the tenant
filter is sent server-side for SM servers that support it.
Key changes:
- schema v68: ADD COLUMN tenant TEXT on memories table (NULL = legacy)
- insertMemoryRow: persists tenant field on every new record
- backfillMemoryTenants / backfillMemoryTenantRows: idempotent migration
called on session_start when SM_ENABLED is set
- querySmMemories: resolves effectiveTenantId (opts.tenant > opts.tenantId
> SM_TENANT_ID); returns [] when no tenant resolved and crossTenant off
- SM_CROSS_TENANT_ENABLED=1 opt-in bypass with audit warning in console
- register-hooks session_start: calls backfillMemoryTenants when SM active
- 12 new tests in memory-tenant-gate.test.mjs; updated sm-client.test.ts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Project Memories section is rendered into every execute-task,
plan-slice, and research-slice prompt. At 10 memories × ~200 chars
each that's ~2K chars/turn injected into the context — real cost,
no operator-visible meter.
Adds two runtime_counters (already-existing key/value store):
memory_inject_chars_total — cumulative section size
memory_inject_count — number of injections
Written by buildProjectMemoriesSection() on every render. Both
writes sit inside a try/catch so a legacy DB without
runtime_counters silently skips rather than blocking prompt build.
`sf headless query` surfaces the cumulative + derived metrics as a
new top-level `memoryInjection` block:
{
total_chars: 12480,
count: 8,
avg_chars: 1560,
estimated_total_tokens: 3120
}
The block is omitted entirely when count is 0 (fresh project / no
prompts rendered yet) so it doesn't clutter the snapshot.
Operators can now correlate prompt size growth against autonomous
run cost without instrumenting the LLM call sites directly. The
estimated_total_tokens is chars/4 — a rough approximation since SF
doesn't tokenise the section, intentionally documented as such.
Resolves sf-mp723yl9-rcxoeh filed via the headless feedback CLI.
Tests: 5 source-level invariants — type carries the section, query
reads counters by name, snapshot omits section on zero, write side
calls both counter functions, write is wrapped in try/catch with
documented failure-mode comment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Even though querySmMemories pins tenantId in the request body sent
to the Singularity Memory server, SF used to accept whatever came
back without verifying. A misconfigured or compromised SM server
could echo memories from other tenants and SF would inject them
into the next execute-task prompt — cross-customer leak.
filterSmMemoriesToTenant() now re-checks every returned memory:
- same-tenant memories pass through
- foreign-tenant memories (memory.tenantId OR memory.tenant !=
expectedTenantId) are dropped, with a one-line warning so the
misconfigured-SM symptom is visible rather than silent
- memories with no tenant claim at all default to allow — matches
the local DB's "NULL tenant = legacy row" rule from schema v68
- SM_REQUIRE_TENANT_CLAIM=true flips the legacy rule to drop
(hard fail-closed mode for operators who want it)
Defensive guards against non-array inputs, missing expectedTenantId
(returns input unchanged so caller-side fail-open semantics are
preserved), and the dual tenantId/tenant field naming.
Tests: 8 cases — same-tenant pass-through, foreign drop, legacy
allow, strict mode drop, tenantId/tenant alias, empty/non-array
defensiveness, missing-expected pass-through, warning emission.
Resolves the cross-project tenant-leak feedback row filed via the
new headless feedback CLI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously buildProjectMemoriesSection(`${sTitle} ${tTitle}`) sent
two short strings to the cosine ranker — too sparse for re-ranking
to do meaningful work against the static pool.
buildMemoryRetrievalQuery() (new, exported for tests) enriches the
query with:
- slice.title + task.title (original signal)
- slice.goal text, front 600 chars
(the WHY of the slice — usually
names the memory-relevant
context the title can't fit)
- top 20 changed files from
git diff/status (the WHAT — what code is in
play right now; lets cosine
ranking promote memories whose
content references those paths)
Fail-open at each source: DB closed → no goal; not a git repo →
no files; nullish title args don't poison the string. The call
site never has to handle errors.
Bounded so embedding token cost stays predictable: 600-char goal
cap, 20-file cap. Empty inputs collapse to "" so the consumer's
`if (!query.trim())` branch still picks the static fallback.
Tests: 5 cases — titles always present, non-git directory safe,
empty-input collapse, nullish-arg defensiveness, real git repo
surfaces changed file paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Enables CI and containerised deployments without writing secrets to disk.
Auth.json still takes precedence when present.
- readGatewayFromAuthJson now falls back to SF_LLM_GATEWAY_KEY env var
- SF_LLM_GATEWAY_URL env var also supported for endpoint override
- Added tests for env fallback, auth.json preference, and default URL
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Self-feedback triage routing was including paid opencode models even
when the operator policy prefers the free tier. Add
isOpenCodeProvider() + isFreeOpenCodeModelId() and filter the
candidate list before the router scores them.
Also: cosmetic — quote style normalised by the formatter on
buildInlineFixPrompt strings and spawn options object.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests were picking up the developer's real
~/.sf/agent/discovery-cache.json and seeing unexpected models in
output. Pin tests to a guaranteed-missing path via the new
_discoveryCacheFilePath option so the env they observe is solely
what the test constructs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surgical read/write access to ~/.sf/agent/auth.json without touching
the file directly. All mutations go through AuthStorage so file-lock
and chmod-600 invariants are always respected.
sf key set <provider> <api-key> add/rotate stored key
sf key get <provider> show masked key (last 4 chars)
sf key remove <provider> [--yes] remove credential
sf key list list all providers + status
Rationale: SF's source of truth for credentials is auth.json at
runtime — env vars are only used during initial one-time provider
setup. Rotation needs an explicit, audit-friendly path, not implicit
env-driven re-reads. Keys are never echoed in full (last 4 chars
only); remove always prompts unless --yes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Activity JSONL logs use `type: "custom_message"` with `customType: "sf-auto"`
for assistant reasoning content. The old code only checked `role === "assistant"`,
so every transcript was empty → extraction silently skipped every unit.
Fix: recognise both legacy (`role === "assistant"`) and modern
(`custom_message` with `sf-*` prefix) entry shapes. Also reads the
standalone `text` field used by custom messages.
This is why memory_processed_units had 0 rows despite 34 activity logs.
Tests: 186 files / 1994 tests pass.
Type check: clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The memory extraction system has infrastructure (DB tables, LLM prompts,
unit closeout wiring, embedding backfill) but zero processed units and
only self-feedback-resolution memories. This suggests extraction is
failing silently.
Add debugLog() calls throughout extractMemoriesFromUnit() so we can
observe:
- Skip reasons (mutex busy, rate limited, already processed, file too small)
- Start/done lifecycle per unit
- LLM call and parse outcomes
- Error messages on failure and retry
This makes the extraction pipeline observable via --debug or the
journal/debug log without changing behavior.
Tests: 185 files / 1993 tests pass.
Type check: clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds full coverage for the discovery-gating root cause that was
fixed in commits d70d8d3b1 (xiaomi x-api-key auth) and the
subsequent refreshSfManagedProviders + writeSdkDiscoveryCacheEntry
work in model-catalog-cache.js.
Diagnosis recap: kimi-coding, opencode, opencode-go were silent
in ~/.sf/agent/discovery-cache.json because the SDK's
model-discovery.js adapter registry marked them with
StaticDiscoveryAdapter (supportsDiscovery=false), so the SDK's
discoverModels() never attempted them. SF's own
scheduleModelCatalogRefresh DID fetch them but wrote only to the
per-repo runtime cache (basePath/.sf/model-catalog/) and only fired
on session_start — not during --discover. The fix is to mirror the
write to the SDK's discovery cache on both fetch-path AND cache-hit
path, and await it in cli.ts before listModels when --discover is set.
New test sections:
- parseDiscoveredModels: OpenAI {data}/{models} formats, Google
{models[].name} prefix stripping, name-as-id fallback, null on
bad input, OpenRouter pricing extraction
- refreshSfManagedProviders: xiaomi uses x-api-key (not Bearer),
opencode uses Bearer, no-key providers skipped, SDK discovery cache
written on BOTH network-fetch and cache-hit paths, kimi-coding +
opencode-go iterated when keys present
46 tests pass. No regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trailing instrumentation from the discovery investigation. The error
catch still swallows non-fatal failures during --discover, just no
longer prints to stderr.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier commit (44fcfb643) incorrectly disabled phrase on repo-root
because I thought phrase retriever hung on full-workspace scope. After
clearing the corrupted cache (left by killing a mid-build vector process),
testing confirms:
- bm25 alone on repo root: works, 1m 50s cold, instant warm
- phrase alone on repo root: works after cache clear
- bm25+phrase on repo root: works after cache clear
- vector on scoped paths: works after cache build
The "hang" was from a corrupted/stale cache, not a sift bug.
.siftignore is properly excluding files (146K→2,660 indexed).
Revert chooseSiftRetrievers back to bm25,phrase for repo-root.
Tests: 184 files / 1974 tests pass.
Type check: clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Today's discovery cache stored only model IDs (string[]). Downstream
isZeroCost(model?.cost) check evaluated against undefined for any
dynamically-discovered model, so OpenRouter's zero-cost-but-not-:free
entries (owl-alpha, lyria-3-pro-preview, lyria-3-clip-preview,
openrouter/free) got silently blocked by the built-in provider policy.
Cache entry shape now: {id, cost?, contextWindow?} per model.
parseDiscoveredModels extracts pricing from OpenRouter's
/api/v1/models response (pricing.prompt/completion/input_cache_read/
input_cache_write → numeric cost.{input,output,cacheRead,cacheWrite}).
Other providers stay {id}-only — their /v1/models endpoints don't
ship pricing.
Migration: on first read of a legacy string[] cache, entries are
converted in-place to {id} objects and the file is rewritten. No cost
backfill (data wasn't there before), but the new readers handle them.
Cost wired into policy: isModelAllowedByBuiltInProviderPolicy calls
lookupDiscoveredModelCost("openrouter", modelId) as a fallback when
the static model registry has no cost data.
Plus: cli.ts --discover now eagerly refreshes SF-managed providers
(opencode, opencode-go, kimi-coding, xiaomi) that the SDK's adapter
doesn't cover — so they populate cache on first --discover instead
of waiting for a session-start lazy refresh.
Tests: 13 new across 5 groups (pricing extraction, round-trip, legacy
migration, policy gate happy/sad paths, Google provider compat).
Full suite: 184 files / 1971 tests, zero regressions.
Real-world result: openrouter/owl-alpha, google/lyria-3-pro-preview,
google/lyria-3-clip-preview, openrouter/free, plus any future
zero-cost models now pass the policy filter on the next discovery
refresh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: the sift binary's phrase retriever hangs indefinitely when
queried against the full repo-root scope (57K+ files). Earlier tests
mistook this for a general slowness, but isolated testing confirms:
- bm25 alone on repo root: works (1m 30s cold, instant warm)
- phrase alone on repo root: hangs forever
- bm25+phrase on repo root: hangs forever (phrase path blocks)
- all retrievers on scoped subdirs: work correctly
The earlier Rust panic was from a corrupted cache state left by killing
a mid-build vector process. After clearing the cache, bm25 alone works.
Fix: chooseSiftRetrievers now returns retrievers: "bm25" (not "bm25,phrase")
for repo-root scope. Scoped subdirs still get bm25+phrase+vector with
position-aware reranking.
Tests: updated 3 assertions in sift-retriever-scope.test.mjs.
Full suite: 183 files / 1958 tests pass.
Type check: clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three providers were missing from PROVIDER_CATALOG_CONFIG so their
model lists couldn't be auto-discovered. Their wire ids only existed
in packages/ai/src/models.generated.ts as hand-coded entries, meaning
new model variants from these providers required manual catalog edits.
Verified live endpoints respond to /v1/models with bearer auth:
- opencode → https://opencode.ai/zen/v1/models (6 free models)
- opencode-go → https://opencode.ai/zen/go/v1/models (15 models)
- minimax → https://api.minimax.io/v1/models (works)
Added entries:
opencode: baseUrl https://opencode.ai/zen, modelsPath /v1/models
opencode-go: baseUrl https://opencode.ai/zen/go, modelsPath /v1/models
minimax: baseUrl https://api.minimax.io, modelsPath /v1/models
(international endpoint; Chinese-network api.minimaxi.com
still handled separately in the SDK)
Auth keys already wired: OPENCODE_API_KEY, OPENCODE_GO_API_KEY (with
OPENCODE_API_KEY fallback), MINIMAX_API_KEY. No env-api-keys.ts changes.
Combined with 385e0b448 (dynamic canonicalIdFor resolver), new model
variants from these three providers will be auto-grouped in
.sf/model-performance.json without hand-editing CANONICAL_BY_ROUTE.
Live counts after fresh discovery will reveal experimental models
absent from static catalog (e.g. opencode's "big-pickle", opencode-go's
deepseek-v4-pro, mimo-v2.5-pro, hy3-preview). The model-router
tolerates unconventional wire IDs — no naming constraints.
To populate cache: rm -rf ~/.sf/runtime/model-catalog/ + relaunch sf.
Tests: 13 new in provider-catalog-discovery.test.mjs (catalog shape,
modelsPath presence, DISCOVERABLE_PROVIDER_IDS inclusion). Full suite
183 files / 1940 tests pass, zero regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After 385e0b448 added the dynamic discovery-cache resolver to
canonicalIdFor, the 15 identity-strip aliases added in 089bf0cbe for
discovered providers became pure redundancy — the dynamic path
returns the same bare modelId from the discovery cache.
Removed (all canonical == bare modelId, all providers in discovery cache):
- minimax/MiniMax-M2.7, minimax/MiniMax-M2.7-highspeed
- mistral/codestral-latest, mistral/devstral-2512,
mistral/devstral-small-2507, mistral/mistral-large-latest,
mistral/mistral-medium-latest, mistral/mistral-small-latest
- zai/glm-4.5, zai/glm-4.5-air, zai/glm-4.6, zai/glm-4.7,
zai/glm-5, zai/glm-5-turbo, zai/glm-5.1
Kept (real aliases — canonical differs from wire id, NOT identity strips):
- kimi-coding/kimi-for-coding → kimi-k2.6 (Moonshot alias)
- mistral/devstral-medium-2507 → devstral-medium-latest (alias to latest)
- minimax/MiniMax-M2 family lowercase mappings (case-change aliases)
Also kept:
- zai/glm-4.5-flash, zai/glm-4.7-flash (not yet in discovery cache;
flash variants may launch before cache refresh — fast-path safety)
- kimi-coding/kimi-k2.6 + kimi-k2-thinking (kimi-coding cache only
has kimi-for-coding; these resolve via _ENTRY_BY_ROUTE fallback)
Tests: 15 new regression tests in canonical-id-dynamic.test.mjs verify
each removed entry STILL resolves correctly via dynamic discovery.
Total 21/21 in that file, plus 101 model-registry tests, plus 16
canonical-id-mapping tests — all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After commit 089bf0cbe added 23 hand-written aliases for production
route keys, the right structural fix is to also consult the dynamic
model-discovery cache (~/.sf/agent/discovery-cache.json). Otherwise
every new model variant from a discovered provider (ollama-cloud +39
models, openrouter +24, etc.) requires another round of hand-editing.
canonicalIdFor now resolves in this order:
1. CANONICAL_BY_ROUTE (static fast path, retains real aliases like
kimi-coding/kimi-for-coding → kimi-k2.6 where canonical differs)
2. _ENTRY_BY_ROUTE (existing static path)
3. canonicalIdFromDiscovery — reads ~/.sf/agent/discovery-cache.json,
finds (provider, modelId) pair, returns bare modelId
In-memory cache with 60s TTL (DISCOVERY_CACHE_TTL_MS) so the readFileSync
on the hot path becomes one disk read per minute at most. canonicalIdFor
is per-dispatch, not per-token, so the overhead is negligible.
Test hook __setDiscoveryCacheForTest lets vitest inject a cache without
touching the fs.
Tests: 6 new in canonical-id-dynamic.test.mjs (dynamic hit, static-alias
wins over dynamic, cache miss → null, null cache graceful, missing-models
graceful, multiple models per provider). Combined with existing
canonical-id-mapping: 22/22 pass. Full suite 1912 pass, no regressions.
Sanity verified: canonicalIdFor("ollama-cloud/glm-5.1") → "glm-5.1"
(dynamic-only, not in static table); canonicalIdFor("unknown/never")
→ null.
Follow-up (in flight, separate agent): prune the static identity-strip
aliases from CANONICAL_BY_ROUTE for providers in the discovery cache
since they're now redundant with the dynamic resolver.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Autonomous mode's model-fallback chain bypassed enabledModels — when zai
429'd, the chain happily fell through to mistral/codestral-latest even
though only minimax/*, kimi-coding/*, zai/*, ollama-cloud/* were allowed.
Of 52 dispatches in this repo's journal this session, 10 (~19%)
escaped the allowlist (mistral×2, opencode-go×3, google-gemini-cli×5).
enabledModels was honored by interactive cycling (settings-manager.ts)
and by self-feedback-drain.js for triage routing, but
auto-model-selection.js's fallback chain in selectAndApplyModel never
read it.
Now: isModelInEnabledList(provider, modelId, enabledModels) filters
each fallback candidate. Supports exact "provider/model" or
"provider/*" wildcard. Empty/undefined list = open behavior (no
regression for setups without an allowlist).
readEnabledModels reads ~/.sf/agent/settings.json once per chain;
swallows IO errors → undefined → no constraint (safe failure mode).
Escape hatch: SF_BYPASS_ENABLED_MODELS=1 disables the check for
emergency / misconfigured cases.
When ALL candidates are filtered out and the chain exhausts, throws
a clear error directing the operator to add to allowlist or unset.
Tests: 13 in enabled-models-fallback.test.mjs covering pattern matrix,
multi-candidate chain skipping, bypass env, and exhaustion path.
Full suite 1906 pass, no regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Of 52 dispatches in this repo's journal this session, 51 landed in
.sf/model-performance.json's _unmapped bucket — meaning the live-outcome
learner couldn't tell which provider/model succeeded or failed. Only
1 dispatch (google-gemini-cli/gemini-3-flash-preview) bucketed correctly.
Root cause was NOT just missing aliases — it was a lazy-load race:
- model-learner.js declared canonicalIdFor as a fire-and-forget dynamic
import side-effect at module bottom
- metrics.js called recordOutcome() synchronously after
`await import("./model-learner.js")` resolved — before the registry
injection promise settled
- Result: _canonicalIdForFn was null for the first dispatch every session.
Every session. Since the file shipped.
Why nobody noticed: _unmapped is a bucket, not an error. No throw, no
warning, no UI surface. Selection still worked because benchmark-selector
+ static hand-tuned scores carry the routing decision. Only the
feedback loop (recordOutcome → adjust scores) was silently severed.
Fix:
- model-learner.js: export `registryReady` promise instead of swallowing it
- metrics.js: await registryReady before recordOutcome()
- model-registry.ts: 23 new CANONICAL_BY_ROUTE entries covering the actual
production fallback chain — zai/glm-4.5{-air,-flash,5,5.1,5-turbo,4.6,4.7,4.7-flash},
mistral/codestral-latest + devstral-2512 + devstral-{small,medium}-* +
mistral-{large,medium,small}-latest, google-gemini-cli/gemini-{2.5-pro,3-flash-preview,3.1-pro-preview},
opencode-go/{glm-5,glm-5.1,mimo-v2-omni,mimo-v2-pro}
Also adds opt-in backfillModelPerformanceFromJournal(basePath) to
reclassify the existing 51 _unmapped records from past journal events.
Never auto-runs; backs up the old file before overwriting.
Tests: 16 in canonical-id-mapping.test.mjs covering pattern matching,
non-mappable cases, bare canonical-id passthrough, and the backfill
path. Full suite 1906 pass, no regressions.
Known follow-up: CANONICAL_BY_ROUTE uses mixed casing (MiniMax-M2.7 vs
minimax-m2) — should be standardized lowercase in a future pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>