Cherry-pick of gsd-build/gsd-2 65ca5aa2e — applies the security hardening
hunks that conflicted minimally:
- mcp-server/env-writer: validate writes against a strict allowlist
- web/api/files: enforce path containment via web/lib/secure-path
- vscode-extension: read binaryPath/autoStart only from trusted
global/default scopes (resolveTrustedSfStartupConfig), avoiding
workspace-controlled override (renamed Gsd → Sf for sf naming)
- New regression tests: mcp-client-security, vscode-startup-security,
web-files-symlink
Skipped hunks (drifted): mcp-server/server.ts, mcp-client/index.ts,
mcp-server/README.md.
Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 a09e01640 — withFileLockSync now actually
acquires a proper-lockfile (was previously a no-op when proper-lockfile
wasn't required) and throws on ELOCKED contention by default. Adds
onLocked: 'skip' option for best-effort callers that tolerate dropped
entries (audit, journal). Modernizes import style (createRequire/join
from imports rather than ad-hoc require). Path-renames preserved
(gsd-pi → sf-run).
Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 53babec29 — lock-wrapped append half.
Wraps appends to .sf/journal/, .sf/audit/events.jsonl, and the
workflow-logger error log in withFileLockSync (onLocked: skip),
preserving best-effort semantics while preventing torn writes
under contention.
Companion to the atomic-write half landed in 3df56cb94. Path-renames
(gsdRoot→sfRoot, gsd-db→sf-db) preserved during conflict resolution.
Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 9340f1e9b (#4423) — doctor self-heal
detection for symlinked staging directories that can cause silent
data loss. Skips native-git-bridge.ts and git-service test (drifted).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 a4f78731f — handles worktree context fallback
and sanitizes paths in paused session resumption. Skips uok-plan-v2-wiring
test hunk (drifted in sf).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 851507913 (#4056) — defensive parsing
so a corrupt or non-array tasks blob in a milestone row doesn't crash
sf-db reads. Test hunk skipped (sf-db.test.ts has drifted).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 53babec29 (Jeremy <jeremy@fluxlabs.net>)
— atomic-write half only. Eliminates torn-write risk on PROJECT.md
queue sync and reports.json/HTML index regeneration by switching
writeFileSync → atomicWriteSync (tmp+rename).
The companion lock-wrapped-append changes (journal.ts, uok/audit.ts,
workflow-logger.ts) are deferred — they need proper-lockfile +
withFileLockSync helper introduced first.
Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Generalize the code-intelligence hook to support multiple indexer
backends, with sift (rupurt/sift) as a new option next to the existing
project-rag MCP server. Backend is selected via CodebaseMapPreferences.
- code-intelligence.ts: new abstraction + sift backend (detect, resolve,
status, context-block contribution)
- preferences-types.ts: codebaseIndexer field (project-rag | sift | none)
- preferences-validation.ts: validate the new field
- bootstrap/system-context.ts, commands-codebase.ts: dispatch on backend
- tests/code-intelligence.test.ts: sift detection/resolution/status tests
(19 pass, 0 fail)
project-rag path unchanged and continues to work.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SubagentBackgroundJobManager tracks long-running subagent jobs with
status, abort support, and TTL-based eviction of completed results.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extracting a class method as a bare reference loses its 'this' context,
causing 'Cannot read properties of undefined' when minimax (or any
provider) triggers the flat-rate auth-mode lookup.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
buildExecuteTaskPrompt was not passing memoriesSection to loadPrompt,
causing headless auto to fail with a template variable error. Also
updated plan-slice-prompt.test.ts to supply the four template variables
(memoriesSection, runtimeContext, phaseAnchorSection, gatesToClose) that
were missing from the test fixture.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The resolver guarded on context.parentURL.includes('/src/') to identify
in-repo source files, but @google/gemini-cli-core installs to
node_modules/@google/gemini-cli-core/dist/src/ which also contains '/src/'.
Relative imports from that dist package (e.g. './config/config.js') were
incorrectly rewritten to './config/config.ts', causing ERR_MODULE_NOT_FOUND
on every test that transitively imports the google-gemini provider.
Fix: add !context.parentURL.includes('/node_modules/') guard.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
blocked-models.ts (new):
Persistent per-project blocklist at .sf/runtime/blocked-models.json.
loadBlockedModels / isModelBlocked / blockModel (file-lock-safe write).
milestone-summary-classifier.ts (new):
classifyMilestoneSummaryContent → "success" | "failure" | "unknown".
isTerminalMilestoneSummaryContent: failure summaries are NOT terminal —
lets auto-mode re-enter a milestone after a failed recovery summary.
state.ts:
Phase 1 (completeMilestoneIds) and Phase 2 (registry) now check
isTerminalMilestoneSummaryContent before treating a SUMMARY as complete.
A failure SUMMARY no longer prematurely parks a milestone.
error-classifier.ts:
Add "unsupported-model" ErrorClass kind with regex detection
(model + not-supported/unavailable/no-access + account/plan/tier).
Checked before "permanent" so /account/i in PERMANENT_RE doesn't swallow it.
auto-model-selection.ts:
Wire isModelBlocked() gate in selectAndApplyModel candidate loop:
skips provider-rejected models and continues to fallbacks.
bootstrap/agent-end-recovery.ts:
Handle cls.kind === "unsupported-model": blockModel(), try fallback chain
skipping already-blocked models, pause if no usable fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ports commit 7fb35ca58 from gsd2 (PR #4769) covering four issues:
#4761 — resolveCanonicalMilestoneRoot in worktree-manager.ts routes
validate-milestone through the live worktree path instead of stale
project-root state when a milestone worktree is active.
#4762 — auditOrphanedMilestoneBranches in auto-start.ts now surfaces
in-progress milestone branches with unmerged commits ahead of main
(previously only complete milestones were audited). Gated on
isClosedStatus so parked/other closed statuses are unaffected.
#4764 — worktree-telemetry.ts: typed emit helpers (emitWorktreeCreated,
emitWorktreeMerged, emitWorktreeOrphaned, emitAutoExit, emitWorktreeSync,
emitCanonicalRootRedirect, emitSliceMerged, emitMilestoneResquash) plus
summarizeWorktreeTelemetry aggregator and nearest-rank percentile().
Wired in: worktree-resolver.ts (create/merge events), auto-start.ts
(orphan telemetry), auto.ts stopAuto (auto-exit with normalized reason),
worktree-manager.ts (canonical-root-redirect). Surfaced in forensics.ts
via detectWorktreeOrphans and Worktree Telemetry sections.
#4765 — slice-cadence.ts: mergeSliceToMain squash-merges each slice's
commits onto main as soon as the slice passes validation (opt-in via
git.collapse_cadence: "slice"). resquashMilestoneOnMain collapses N
per-slice commits into one milestone commit at completion. Wired in
auto-post-unit.ts (slice merge after complete-slice with stopAuto on
conflict/error) and worktree-resolver.ts (resquash at mergeAndExit).
AutoSession.milestoneStartShas tracks the pre-first-slice SHA.
GitPreferences and preferences-validation.ts extended with
collapse_cadence and milestone_resquash fields.
Also ports /sf scan command: commands-scan.ts with parseScanArgs,
resolveScanDocuments, buildScanOutputPaths, and handleScan dispatching
a focused codebase assessment prompt to .sf/codebase/.
journal.ts: 9 new JournalEventType values for the telemetry events.
All changes are additive; default behavior (cadence="milestone") unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
reassess-roadmap: flip default from true → false. Most reassess units
conclude "roadmap is fine" burning a session for no change; the
plan-slice prompt now carries a JIT preamble at zero cost. (#4778)
tool-execution: always prefer toolDefinition.label when non-empty,
even when label === name — allows tools to display their canonical
name explicitly. (#4758)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds support for project-local SF extension plugins dropped in
.sf/extensions/. Trust-gated (requires pi trust), symlink-escape safe.
- ecosystem/sf-extension-api.ts: SFExtensionAPI wrapper exposing
getPhase() and getActiveUnit() to third-party handlers; updateSnapshot
refreshes state before_agent_start so handlers see current phase/unit
- ecosystem/loader.ts: discovers .sf/extensions/*.js, loads them via
dynamic import, dispatches factory(api) for each
- register-extension.ts: initializes ecosystemHandlers array, wires loader
- register-hooks.ts: before_agent_start refreshes snapshot then dispatches
ecosystem handlers before returning SF system prompt
- types.ts: SFActiveUnit interface (milestoneId/sliceId/taskId + titles)
- workflow-logger.ts: "ecosystem" added to LogComponent union
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes a bug where per-unit tool narrowing poisoned the policy gate for
subsequent units, causing "Model policy denied dispatch before prompt send"
errors on complete-slice and discuss-milestone (100% Win repro).
Four-part port from gsd2@817031b2a:
- ModelPolicyDispatchBlockedError class with per-model deny reasons
- TOOL_BASELINE WeakMap + clearToolBaseline/restoreToolBaseline lifecycle
- auto-model-selection: use getRequiredWorkflowToolsForAutoUnit as requiredTools
- auto/loop: catch ModelPolicyDispatchBlockedError as non-retryable (pause)
- auto.ts: wire clearToolBaseline at startAuto (fresh only) and stopAuto
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8 fixes from 3rd-pass scan:
1. web/components/sf/tempCodeRunnerFile.tsx: remove orphan VS Code
'Code Runner' artifact (850+ lines duplicated from shell-terminal.tsx).
Unreferenced but compiled into tsc project.
2. sf/phase-anchor.ts: writePhaseAnchor used plain writeFileSync — a crash
mid-write would corrupt the handoff checkpoint that readPhaseAnchor then
silently returns null for, losing cross-phase context. Switched to
atomicWriteSync (already used by sibling files).
3. sf/forensics.ts: same non-atomic writeFileSync on active-forensics.json
marker. Race with a concurrent reader produces an empty object and the
forensics session is lost. Switched to atomicWriteSync.
4. web/auto-dashboard-service.ts: paused-session.json existence was the
intended signal but a corrupt body silently dropped the paused flag so
the UI showed active. Now reports paused on file existence regardless
of body integrity, and warns on corruption.
5. sf/visualizer-data.ts: doctor-history.jsonl parser did .map(JSON.parse)
inside an outer catch. One corrupt line discarded 19 valid entries.
Per-line try/catch preserves the valid rows.
6. sf/files.ts: three parseInt calls without radix (step, total_steps,
totalSteps) — also missing || 0 fallback for NaN.
7. cli.ts: parseInt(process.versions.node) without radix. Split on '.' and
use radix 10 explicitly.
8. sf/slice-parallel-orchestrator.ts: silent 'catch {}' around spawn()
masked worker-spawn failures as 'no workers available'. Matches sibling
parallel-orchestrator.ts pattern — now logs via logWarning.
Skipped from the scan (need a real lock mechanism, not safe as a one-line
fix):
- sf/auto-dispatch.ts:164 (UAT counter race)
- sf/captures.ts:107 (CAPTURES.md append race)
Deferred (low-value):
- preferences-models.ts, key-manager.ts, auto-timers.ts silent catches
- dead variable in visualizer-data.ts
- google-gemini-cli.ts maxTokens clamp interaction
tsc --noEmit green at root.
The per-session branded welcome overlay was added by the SF rebrand
(9d739dfa5) as a boxed 'Press any key to continue...' splash shown once
per sf session. In practice: Enter doesn't dismiss it and typing renders
as garbled characters behind the overlay, blocking every TUI launch.
Branding was redundant with the header (installed at session_start) and
the footer (git branch + model). Shortcuts are discoverable via help.
Deleting the overlay eliminates the hang vector entirely.
Legacy-extension migration warnings (migrations.ts 'Press any key...')
are unaffected — those are vendored upstream Pi code on a different
code path and only fire when deprecated extensions are present.
resolveModelId now prefers google-gemini-cli over google (direct API) for
bare Gemini/Gemma IDs, matching the operational default after the CLI-core
re-platform. google-vertex is still honoured when it's the current provider.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
postUnitPreVerification now calls stageOnly() for execute-task units when
action=commit, setting stagedPendingCommit=true and capturing task context.
postUnitPostVerification commits the staged index after the gate passes,
using a conventional-commit message built from the task context. Failure is
non-fatal (logWarning + UI warning). 11 structural tests cover the full
deferral lifecycle.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix 2: verification gate no longer passes when no commands are
configured. Empty-commands result now returns passed=false, skipped=true.
Updated verification-gate.test.ts; added skipped-result guard in
auto-verification.ts that warns and continues (not a hard failure).
Fix 3: split auto-verification.ts try/catch into two zones. Zone 1
(gate machinery: prefs load, task lookup, runVerificationGate,
captureRuntimeErrors, runDependencyAudit) catches → pauseAuto + return
"pause". Zone 2 (ancillary: evidence writes, UOK gate, notifications)
catches → logWarning + return "continue". Added verification-fail-
closed.test.ts with 11 structural tests.
Fix 1 (infrastructure): added stageOnly() + commitStaged() to
GitServiceImpl, added stagedPendingCommit flag to AutoSession (cleared
in reset()), marked the runTurnGitAction call site in
postUnitPreVerification with TODO(fix-1-deferral) for the final wiring.
Fix 4: timeout handler in runFinalize now captures hadStagedPending and
hadCommitted before nulling currentUnit. Clears stagedPendingCommit to
prevent orphaned deferred commits. Emits a diagnostic warning for each
case so operators know whether staged-but-uncommitted changes will be
absorbed or whether a commit landed before verification was skipped.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace separate dispatchHeadlessBootstrap with one flow:
- dispatchNewMilestoneDiscuss({ auto }) — auto=true uses headless
prompt + rootFiles seed, no pendingAutoStartMap; auto=false uses
discuss prompt with preparation, sets pendingAutoStartMap
- bootstrapNewMilestone() — project setup + ID reservation, called
directly from bootstrapAutoSession instead of the old wrapper
- injectTodoContext() — reads and deletes todo.md/TODO.md/SPEC.md at
project root, injects content as spec into any preamble; called
identically in auto and interactive flows
Removes dispatchHeadlessBootstrap entirely. auto-start.ts now calls
the primitives directly. All three showWorkflowEntry new-milestone
sites use dispatchNewMilestoneDiscuss({ auto: false }).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Antigravity (Google's IDE sandbox product, different from Gemini CLI) is
removed from:
src/onboarding.ts — drop from LLM_PROVIDER_IDS + OAuth-flow picker
src/pi-migration.ts — drop from LLM_PROVIDER_IDS migration list
src/web/onboarding-service.ts — drop from web-UI provider list
src/tests/integration/web-onboarding-contract.test.ts — update contract
src/resources/extensions/sf/doctor-providers.ts — drop from CLI_AUTH_PROVIDERS
src/resources/extensions/sf/key-manager.ts — drop UI listing
src/resources/extensions/sf-usage-bar/index.ts — delete entire quota fetcher block (~200 lines)
packages/pi-coding-agent/src/cli/args.ts — drop PI_AI_ANTIGRAVITY_VERSION doc
packages/pi-coding-agent/src/utils/proxy-server.ts — drop from claude provider chain
Reason: antigravity has no vendor-published core library we can embed
(unlike @google/gemini-cli-core for the Gemini CLI). Continuing to
hand-roll OAuth against daily-cloudcode-pa.sandbox.googleapis.com is
exactly the pattern Google has started banning for third-party tools.
Removing the code removes the ban risk.
pi-ai provider code, OAuth util, and models.generated entries for
google-antigravity are removed in follow-up commits (separated for
reviewability — each layer verified independently).
Build passes. Note: this is a breaking change for any user who had
google-antigravity configured — they'll need to migrate to
google-gemini-cli (OAuth), google (API key), or google-vertex.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Gemini had zero benchmark entries in model-benchmarks.json despite
being served by google-gemini-cli (OAuth provider, SF native), google
(API key), google-vertex, google-antigravity, openrouter, etc. Every
gemini-* model in the pi-ai catalog scored 0 in the benchmark selector
— effectively excluded from auto-selection even when allow-listed.
Published numbers from DeepMind model cards + Vellum LLM leaderboard +
Vals AI:
gemini-3-pro-preview: SWE-Verified 76.2, HLE 37.5, AIME25 95,
GPQA-D 91.9, MMLU-Pro 81.0
gemini-3.1-pro-preview: SWE-Verified 78, HLE 41, AIME 97,
GPQA-D 93, MMLU-Pro 83 (Feb 2026)
gemini-3-flash-preview: estimated from Pro-vs-Flash delta
gemini-2.5-pro: SWE-Verified 63.8, HLE 18.8, GPQA-D 84.0,
MMLU-Pro 86
gemini-2.5-flash: estimated from Pro-vs-Flash delta
Context windows reflect Gemini's 1M-2M token capability.
LiveCodeBench Pro Elo (2439 for Gemini 3 Pro) isn't in the 0-100
percent schema — skipped rather than forced. Future: add arena_elo-
style LCB Elo dimension to the schema if we start routing on it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When two models score identically in the benchmark selector — typically
the same underlying weights served by different endpoints — the
previous alphabetical tiebreaker picked wrong. dr-repo example:
zai/glm-5.1 score 84.7
opencode-go/glm-5.1 score 84.7
Both are the exact same GLM-5.1 weights. Alphabetical comparison made
opencode-go win ("o" < "z") even though zai is the NATIVE provider.
Fix: new `provider_preference` pref, an ordered list of providers.
Listed providers rank in order, unlisted fall after alphabetically.
Applied as the tie-breaker between score and alphabetical.
Global default shipped in ~/.sf/preferences.md:
kimi-coding, minimax, zai, mistral, ollama-cloud, opencode-go,
opencode
Native providers ranked before re-servers. Users can override per
project.
Verified: after the change, dr-repo picks zai/glm-5.1 as primary for
execute-task and gate-evaluate (was opencode-go/glm-5.1), and
kimi-coding/k2p5 stays primary for completion phases with its direct
provider winning over opencode re-servers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The original "normalise by populated weight" was too aggressive: a model
with 1 strong dimension (delta-fast: human_eval=92) outranked a model
with 4 strong dimensions (beta-coder: swe_bench=85, lcb=90, he=95,
ifeval=90) because both normalised to their own small average.
Fix: multiply normalised score by a confidence factor tied to how much
of the unit's profile the model actually populated. Confidence =
populated_weight / total_profile_weight, blended 50/50 with a flat floor
so sparse-but-strong specialists still rank when no generalist covers
the profile:
score = (weighted_sum / weight_total) * (0.5 + 0.5 * confidence)
Net effect on dr-repo's auto-resolve:
Before: After:
plan-milestone glm-5.1 plan-milestone MiniMax-M2.5
research-slice codestral research-slice mistral-large-2411
execute-task mistral-large execute-task opencode-go/glm-5.1
validate-m magistral validate-m MiniMax-M2.5
subagent mistral-large subagent kimi-coding/k2p5
MiniMax's broad coverage (8 populated dimensions from the M2 README)
now correctly outranks GLM-5.1's higher but narrower scores for
reasoning-heavy units. Matches user intuition that "MiniMax is really
powerful".
Also fixes findBenchmarkKey to try "<modelId>-latest" for date-suffixed
model variants — pi-ai catalogs "devstral-medium-2507" but benchmarks
only have "devstral-medium-latest"; matcher now bridges that.
12 regression tests cover:
- empty candidate pool
- each profile (reasoning/coding/lightweight) picks right champion
- swe_bench ↔ swe_bench_verified equivalence
- models with all-null benchmarks score 0 but stay in fallbacks
- sparse-strong beats dense-weak (confirms confidence multiplier
doesn't over-penalise specialists)
- provider diversification in fallback chain
- deterministic tie-breaking
- unknown unit types use default coding profile
- date-suffixed model IDs match family-latest keys
Audit: 41 of 85 allow-listed models in pi-ai catalog have benchmark
data. 44 score 0 (mostly opencode Zen re-served models, ministral
small variants, pixtral vision models, legacy open-mistral). Top
picks for every dr-repo unit type DO have benchmark data — the gap
is in the long tail of fallbacks, which never matter unless the
primary and closer fallbacks all fail.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New module src/resources/extensions/sf/benchmark-selector.ts implements
benchmark-driven model selection. When models.<unit> is not pinned,
preferences-models.ts falls through to pick the highest-scoring
candidate from allowed_providers × pi-ai's model catalog, ranked
against a per-unit-type weight profile.
Weight profiles per unit type:
plan-milestone / plan-slice → agent-planning (swe_bench .25, lcb
.20, hle .15, gpqa .15, mmlu_pro .15,
aime .10)
research-* → mixed (mmlu_pro, hle, human_eval,
browse_comp, simple_qa, gpqa)
execute-task → coding (swe_bench .35, swe_bench_v
.25, lcb .20, human_eval .15)
execution_simple / complete-* → fast+correct (human_eval .40,
instruction_following .35, ruler .25)
gate-evaluate → review (swe_bench .30, hle .25,
gpqa .25, ifeval .20)
validate-milestone → validation (hle .30, gpqa .25,
mmlu_pro .25, swe_bench .20)
Key design decisions:
- Missing dimensions are dropped (normalised by populated weight),
so a model with 2 strong populated scores isn't crushed by a peer
with 5 mediocre ones.
- swe_bench ↔ swe_bench_verified are fungible — some vendors publish
one, some the other; treat as equivalent.
- Provider diversification in fallbacks so one provider going 429
doesn't kill the whole chain.
- Score ties broken by coverage, then lexical — deterministic.
Also updates MiniMax-M2/M2.5/M2.7 benchmarks with real numbers from
the M2 official README (DeepWiki sourced) and MiniMax-M2.5 card
(minimax.io): swe_bench_verified 69.4→80.2, LCB 83, HLE 31.8 (w/
tools — more representative for agent work than no-tools 12.5),
AIME25 78, GPQA-D 78, MMLU-Pro 82. Context windows bumped to
weights-level: M2 400K, M2.5/M2.7 1M (endpoints may cap lower).
Verified end-to-end: with dr-repo's allow-list
(kimi-coding/minimax/zai/opencode-go/mistral) and models.* absent,
resolveModelWithFallbacksForUnit() returns:
plan-milestone → opencode-go/glm-5.1 (+3 fallbacks)
research-slice → mistral/codestral-latest
execute-task → mistral/mistral-large-latest
execution_simple → kimi-coding/k2p5
gate-evaluate → opencode-go/glm-5.1
validate-milestone → mistral/magistral-medium-latest
subagent → mistral/mistral-large-latest
Users can still pin individual units (existing models.* behaviour
unchanged) or rely fully on auto-selection by omitting them.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four related improvements that landed in the working tree after the
auto-hardening merge but hadn't been committed:
1. auth_error as a distinct error type (auth-storage + retry-handler).
Previously invalid/expired API keys would retry the same failing
credential until the retry budget exhausted. Now:
- classifyErrorType() recognizes 401s, "invalid api key",
"authentication error", "unauthorized" etc as "auth_error"
- RetryHandler triggers cross-provider fallback on auth_error just
like it does for rate_limit and quota_exhausted — switch
providers rather than burning retries on a broken key
Outcome: a stale OPENCODE_API_KEY in sops now fails over to kimi or
minimax immediately instead of stalling the unit.
2. Multi-provider search-key detection (native-search.ts).
The "Web search: Set BRAVE_API_KEY" warning fired whenever a
non-Anthropic model lacked BRAVE_API_KEY, even when the user had
TAVILY_API_KEY or OLLAMA_API_KEY available. Now: the warning
suppresses if any of BRAVE/TAVILY/OLLAMA keys is present, and the
warning text lists all three options. Matches the preferences-
validation allow-list for search_provider.
3. MiniMax-M2.7-highspeed benchmark entry (model-benchmarks.json).
Routes the fast-tier variant of M2.7 through the Bayesian blender
with inherited RULER scores. Lets dynamic routing consider the
highspeed model when speed matters more than peak quality.
No regressions: the 41 pre-existing test failures in pi-coding-agent
(FallbackResolver chain-membership + LSP integration) are unchanged
relative to the prior commit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
opencode-go is already a first-class provider in pi-ai (models.generated.js
registers 7 models under the opencode-go namespace: glm-5, glm-5.1,
kimi-k2.5, mimo-v2-{omni,pro}, minimax-m2.{5,7}) and runs against
https://opencode.ai/zen/go/v1 with OPENCODE_API_KEY auth.
It was missing from key-manager's LLM provider registry, so the /sf
config wizard and onboarding flows didn't prompt users to supply
OPENCODE_API_KEY. Adding it here gives users a discoverable path to
subscribe and surface the 7 opencode-go models in list-models.
Research confirmed (DeepWiki sst/opencode + curl probes):
- /zen/go/v1/chat/completions is the OpenAI-compatible endpoint
- OPENCODE_API_KEY is the correct env var
- No /models listing endpoint — hardcoding is correct (already done
by the generate-models.ts pipeline)
- Sister /zen/go/v1/messages serves Anthropic-compat minimax variants
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New feature: allowed_providers — hard allowlist of providers that
auto-mode can dispatch to. When set, models from any other provider
are invisible to selection BEFORE models.* resolution and dynamic
routing run. This prevents routing from silently picking providers
the user doesn't have keys for — the root cause of repeated
"400 The requested model is not supported" pauses observed in
dr-repo when routing picked gpt-5.2-codex despite no GPT being
configured.
Implementation is a single filter at the top of selectAndApplyModel:
availableModels = rawAvailable.filter(m => allowed.includes(m.provider.toLowerCase()))
If the allowlist rejects everything, throw with a clear message
pointing at the pref (fail-closed — don't dispatch to whatever's
left).
While wiring this I found mergePreferences was silently dropping
six more validated fields — same latent-bug class as service_tier:
- allowed_providers (new) - flat_rate_providers
- stale_commit_threshold_minutes - widget_mode
- modelOverrides - safety_harness
All added to the merge function. Now: if you set it in PREFERENCES,
consumers see it.
Verified end-to-end: loadEffectiveSFPreferences() reads
allowed_providers from dr-repo's .sf/PREFERENCES.md correctly, and
auto-mode model selection honors the filter.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two related fixes surfaced from a real sf headless auto run in dr-repo.
1. Project preferences now resolve from the MAIN worktree, not the
current linked worktree. SF's auto-mode creates a git worktree per
milestone (`.sf/worktrees/M003/`). The old code called
`projectPreferencesPath()` which used `process.cwd()` — the
milestone worktree — so a pref change on main (service_tier,
dynamic_routing, model config) never reached an in-flight milestone
until the branch merged main. Observed concretely when disabling
dynamic_routing had no effect until we merged main into the
milestone branch.
New `projectPrefsRoot()` detects a linked worktree by reading
`.git` (a FILE in worktrees, pointing to
`/main/.git/worktrees/NAME`), follows the `commondir` pointer back
to the main `.git` dir, and walks up one level. Falls back to cwd
silently for non-worktree setups.
2. MCP server config now also loads from global paths
(`~/.sf/mcp.json`, `~/.sf/agent/mcp.json`) in addition to the
existing project-level (`.mcp.json`, `.sf/mcp.json`). First-hit
wins, so project configs can still shadow or augment a globally-
registered server by name. This lets the user register unauth'd
servers like the DeepWiki remote MCP once and have every SF
project pick it up without per-project `.mcp.json`.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All 9 research/planning/discuss prompts updated to put DeepWiki
first in the docs-lookup order. Context7 becomes the fallback for
package-registry-only libraries.
Rationale: Context7 free tier is capped at 1000 req/month — a
research-heavy auto loop can burn through that in a single session.
DeepWiki has no cap and covers any GitHub-hosted library with
AI-indexed answers, so it's strictly better as the default for the
typical SF research path.
Prompts touched:
system.md, discuss.md, discuss-headless.md, plan-milestone.md,
queue.md, research-milestone.md, research-slice.md,
guided-discuss-milestone.md, guided-discuss-slice.md,
guided-research-slice.md
Each references the three DeepWiki tools — ask_question,
read_wiki_structure, read_wiki_contents — and explicitly mentions the
Context7 1000-req/month cap so models don't spend it wastefully.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sf headless query and sf headless status call resolveDispatch() without
going through auto-mode startup, so the rule-registry singleton is
never initialized. The previous code caught getRegistry()'s init error
and logged a warning on every call — noise on the normal path:
[sf:dispatch] WARN: registry dispatch failed, falling back to inline
rules: RuleRegistry not initialized — call initRegistry() or
setRegistry() first.
Now: hasRegistry() probe first. When unset, skip straight to the inline
rule loop without warning (it's the intended behavior outside auto).
When the registry IS set and evaluateDispatch() genuinely throws, log
the warning so real bugs still surface.
Adds hasRegistry() as a public helper for any other hot-path caller
that wants to branch on init without try/catch overhead.
Verified end-to-end: sf headless query and sf headless status in
dr-repo now run clean, no false warning. All 25 rule-registry tests
pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same class of bug as the service_tier fix: preference fields declared
in SFPreferences type and consumed by feature code, but never copied
into the validated output, so they silently become undefined when set
in PREFERENCES.md.
Found by diffing validated.<field> vs the interface declarations:
- forensics_dedup (boolean) — /sf forensics issue de-dup opt-in
- stale_commit_threshold_minutes (number) — doctor safety-commit cadence
- widget_mode ("full"|"small"|"min"|"off") — dashboard widget sizing
- slice_parallel ({ enabled?, max_workers? }) — slice-level parallelism
- modelOverrides (Record) — per-model capability patches
- safety_harness ({ enabled?, evidence_collection?, ... }) — LLM safety
Validation is kind-appropriate: primitives get type + range checks,
nested objects get object-shape guards with pass-through for now.
Consumer sites already treat missing fields as optional, so landing
shallow validation first is safe.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
validatePreferences() is a strict allow-list — it copies only explicitly
handled fields from input to output. service_tier was in
KNOWN_PREFERENCE_KEYS (no unknown-key warning) but was never copied into
the validated output, so users setting service_tier: priority or flex in
PREFERENCES.md silently got undefined.
This was a latent bug from before today's work: the new "off" value hit
it first because I verified end-to-end, but priority/flex had the same
issue. /sf fast on writes "priority" via writeGlobalServiceTier —
correctly — and then the next read drops it on the floor.
Now: service_tier is validated against {priority, flex, off} and copied
through. Invalid values raise an error rather than being silently lost.
Verified: dr-repo's service_tier: "off" in .sf/PREFERENCES.md now loads
correctly via loadEffectiveSFPreferences().preferences.service_tier.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three TUI-only decorations were running their full session-lifecycle
handlers even in headless mode, where there is no footer to render
into. Most visibly, the emoji extension's AI auto-assign path made a
real LLM call to pick a 🚀/✨/🎯 that nothing would ever see.
- sf-tui/emoji.ts: session_start and agent_start handlers early-return
when !ctx.hasUI. Commands stay registered so /emoji still works if
someone runs it, but the lifecycle work (state loading, AI emoji
selection, setStatus emission) is skipped.
- sf-tui/color-band.ts: session_start and session_switch handlers
early-return when !ctx.hasUI. Avoids unnecessary state-file writes
and resize-listener attachment in headless runs.
- sf-permissions/index.ts:setLevel: guards the setStatus("authority",
…) call behind ctx.hasUI. The existing session_start path was
already gated — this closes the permission-change code path.
Headless stderr was already filtering these keys, so the user-visible
output is unchanged. This eliminates the underlying RPC traffic and
— more importantly — stops spending LLM tokens on decorative emoji
selection in headless runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds an explicit disable state (service_tier: "off" in PREFERENCES.md)
that short-circuits every service-tier surface:
- No setStatus("sf-fast", …) footer events — RPC traffic stops, not
just the stderr filter masking it.
- No service_tier field ever injected into before_provider_request
payloads, regardless of model.
- /sf fast on and /sf fast flex refuse to write a tier while "off" is
set, instructing the user to clear the preference first.
- /sf fast status shows "(service_tier: \"off\" in preferences)" so
the explicit disable is visible at a glance.
Rationale: setups that never run gpt-5.4 (Claude / Kimi / MiniMax /
GLM / Gemini-only shops) have no use for the feature. "off" lets them
fully turn it off rather than relying on model-support gates to
silence it.
6 regression tests added in service-tier.test.ts covering the new
isServiceTierDisabled export, hook short-circuit ordering, and the
/sf fast command refusal. 52 / 52 service-tier tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously the bundled Ollama extension probed http://localhost:11434
on every session_start, which was wasted work for users who never run
Ollama locally. It also registered slash commands, loaded the
ollama_manage tool, and (in interactive mode) set a "[phase] ollama"
status indicator that leaked into headless stderr.
Now the default export short-circuits immediately when OLLAMA_HOST is
not set — no probe, no command registration, no tool loading, no
status indicator. probeAndRegister also double-checks so any direct
caller stays consistent.
ollama-cloud is unaffected: set OLLAMA_HOST=https://ollama.com and
OLLAMA_API_KEY=<key> and everything runs as before. Self-hosted local
Ollama is likewise unaffected — set OLLAMA_HOST=http://localhost:11434
explicitly to re-enable the old behavior.
3 new regression tests cover the opt-in guard. All 138 ollama tests
pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds an optional model param to SubagentParams, TaskItem, and ChainItem so
callers can override the agent's default model at dispatch time. This is
the primitive that ace-coder's Task() tool exposes via its `model` arg —
SF's subagent tool previously ignored model at the tool level, picking it
up only from the named agent's .md frontmatter.
- SubagentParams.model applies to single mode, or as a batch-level default
for tasks/chain steps that don't set their own.
- TaskItem.model and ChainItem.model override per-task / per-step.
- runSingleAgent and runSingleAgentInCmuxSplit gain a trailing
modelOverride parameter that flows into buildSubagentProcessArgs.
- buildSubagentProcessArgs uses modelOverride ?? agent.model when picking
the --model arg for the child process.
Side benefit: retroactively fixes the latent bug where
reactive_execution.subagent_model was threaded into prompt instructions
but ignored by the actual tool.
9 regression tests added in subagent/tests/model-override.test.ts.
All 53 subagent-related tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add 6 new skills under src/resources/extensions/sf/skills/
- Revert broken dispatch_model extension from auto-prompts.ts — the subagent
tool has no model-override param; skills stay as pure text injection
- Fix discuss-headless.md: advisory-partner section now correctly describes
that independent review runs via gate-evaluate/validate-milestone (Q3/Q4,
MV01-MV04) with the validation model, not inline self-review
- Include pm-planning, codebase-analysis, architecture-planning, and
feature-gap-analysis skill activations in discuss-headless Active Skills
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>