Milestone-end workflow that compares declared product intent (VISION.md,
RUNBOOKS.md, etc.) against actual code/test/deploy/docs evidence and
emits structured gaps with severity. Soft gates — adds follow-up slices
but doesn't hard-block merge.
Slim port (4 new files + 1 registration) — extracts only the audit
feature itself, not bunker's parallel rewrite of dispatch/prompts/
benchmark-selector that came with it in commit 2aa785475.
Created:
- prompts/product-audit.md — prompt verbatim, gsd_*→sf_* and .gsd→.sf
- tools/product-audit-tool.ts — slim file-write implementation,
atomicWriteAsync to .sf/active/{mid}/
PRODUCT-AUDIT.{json,md}; no DB deps
- bootstrap/product-audit-tool.ts — pi-coding-agent tool registration,
TypeBox schema for sf_product_audit
- workflow-templates/product-audit.md — workflow template
Modified:
- bootstrap/register-extension.ts — 2 lines: import + add to nonCriticalRegistrations
- workflow-templates/registry.json — registry entry
- package.json — version 2.75.0 → 2.75.1
Verdict logic (no-gaps | gaps-found | contract-underspecified) is the
load-bearing innovation: contract-underspecified forces the auditor to
flag unverifiable docs as a real gap rather than rubber-stamping
no-gaps when the product contract is silent.
Out of scope: phase enum changes, dispatch hookup. Wire-up to the phase
machine is a follow-up; the prompt + tool + template stand alone.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three small UX fixes for headless / autopilot logs:
1. Add `zz-notifications` to TUI_FOOTER_STATUS_KEYS — these are sticky
notification dots from the interactive TUI footer; they have no
meaning in headless and were spamming the log.
2. Categorize notification messages by prefix so headless output is
scannable: [mcp] for MCP-client-ready, [search] for web search status,
[parallel] for slice-parallel/subagent dispatch. Falls through to
the existing important/non-important formatting for everything else.
3. Distinguish phase transitions from generic status updates: phase:/
milestone:/slice:/task: prefixed keys get [phase]; everything else
gets [status]. Previously both used [phase], which was misleading.
Patterns based on bunker commits 14ec4d97f / c15afb45f (which were the
research source) but written fresh against our existing
TUI_FOOTER_STATUS_KEYS structure rather than cherry-picked.
The assistant-text-preview commit (cf0274c63) is a separate, larger
refactor in headless.ts and is deferred to v3.1.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
We have .serena/ configured (cache, memories, project.local.yml) but no
prompt mentioned Serena anywhere. Agents weren't using it for symbol
lookup or cross-file architecture mapping; they fell straight to rg/find.
Added a one-sentence Serena hint to the code-exploration step in:
- research-slice.md
- research-milestone.md
- plan-slice.md
- plan-milestone.md
- guided-research-slice.md
Phrased generically ("If a repo-intelligence MCP (e.g. Serena) is
configured...") so it degrades cleanly when Serena isn't set up.
Pattern based on bunker commit 4ba746888 but written fresh against our
post-rename prompt structure rather than cherry-picked.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit captures uncommitted modifications that accumulated in the
working tree across multiple in-progress workstreams. It is a snapshot
to clear the deck before sf v3 work begins; individual workstreams
should land separately on top of this.
Notable additions:
- trace-collector.ts, traces.ts, src/tests/trace-export.test.ts —
trace export plumbing
- biome.json — Biome linter configuration
- .gitignore — exclude native/npm/**/*.node compiled binaries
The bulk of the diff is across src/resources/extensions/sf/ (301 files)
and src/resources/extensions/sf/tests/ (277 files), reflecting the
ongoing sf extension work. Specific feature commits should follow this
snapshot rather than being archaeology'd out of it.
The 76MB native/npm/linux-x64-gnu/forge_engine.node compiled binary
was left out of the commit — it's now gitignored and built locally.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codex-rescue output (a299c461 / bnr88iy59) — the 'Git merge approved once'
followed seconds later by 'Git merge declined by user' bug we hit on
M002 complete-milestone. Same gate, same agent run, opposite verdicts.
Single source of truth for the merge-gate state in guardrails/index.ts.
Approval is now sticky — re-asks return the cached approval until consumed
or explicitly revoked, never auto-flip to decline. Timeout converts to
pause+log instead of decline. Adds tests/safe-git-merge-gate.test.ts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: OpenAI Codex <noreply@openai.com>
Two codex-rescue tasks landed together:
1. Auto-coerce JSON-schema validator: when a tool field declares
{type:"array", items:{type:"string"}} and the model sends a single
string, wrap it in [string] before validation instead of hard-rejecting.
Fixes the recurring "keyDecisions: must be array" rejection on
sf_complete_task that wasted retries.
2. Provider_model_allow filter (proper implementation with helpers):
- resolveProviderModelAllowList / isProviderModelAllowed /
filterModelsByProviderModelAllow helpers in preferences-models
- Wired into model-registry and auto-model-selection
- New tests/provider-model-allow.test.ts
Tools coerced: sf_complete_task, sf_complete_milestone, sf_plan_milestone,
sf_plan_slice, sf_replan_slice, sf_reassess_roadmap (key list fields).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: OpenAI Codex <noreply@openai.com>
Cherry-pick of gsd-build/gsd-2 65ca5aa2e — applies the security hardening
hunks that conflicted minimally:
- mcp-server/env-writer: validate writes against a strict allowlist
- web/api/files: enforce path containment via web/lib/secure-path
- vscode-extension: read binaryPath/autoStart only from trusted
global/default scopes (resolveTrustedSfStartupConfig), avoiding
workspace-controlled override (renamed Gsd → Sf for sf naming)
- New regression tests: mcp-client-security, vscode-startup-security,
web-files-symlink
Skipped hunks (drifted): mcp-server/server.ts, mcp-client/index.ts,
mcp-server/README.md.
Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 a09e01640 — withFileLockSync now actually
acquires a proper-lockfile (was previously a no-op when proper-lockfile
wasn't required) and throws on ELOCKED contention by default. Adds
onLocked: 'skip' option for best-effort callers that tolerate dropped
entries (audit, journal). Modernizes import style (createRequire/join
from imports rather than ad-hoc require). Path-renames preserved
(gsd-pi → sf-run).
Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 53babec29 — lock-wrapped append half.
Wraps appends to .sf/journal/, .sf/audit/events.jsonl, and the
workflow-logger error log in withFileLockSync (onLocked: skip),
preserving best-effort semantics while preventing torn writes
under contention.
Companion to the atomic-write half landed in 3df56cb94. Path-renames
(gsdRoot→sfRoot, gsd-db→sf-db) preserved during conflict resolution.
Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 9340f1e9b (#4423) — doctor self-heal
detection for symlinked staging directories that can cause silent
data loss. Skips native-git-bridge.ts and git-service test (drifted).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 a4f78731f — handles worktree context fallback
and sanitizes paths in paused session resumption. Skips uok-plan-v2-wiring
test hunk (drifted in sf).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 851507913 (#4056) — defensive parsing
so a corrupt or non-array tasks blob in a milestone row doesn't crash
sf-db reads. Test hunk skipped (sf-db.test.ts has drifted).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-pick of gsd-build/gsd-2 53babec29 (Jeremy <jeremy@fluxlabs.net>)
— atomic-write half only. Eliminates torn-write risk on PROJECT.md
queue sync and reports.json/HTML index regeneration by switching
writeFileSync → atomicWriteSync (tmp+rename).
The companion lock-wrapped-append changes (journal.ts, uok/audit.ts,
workflow-logger.ts) are deferred — they need proper-lockfile +
withFileLockSync helper introduced first.
Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Generalize the code-intelligence hook to support multiple indexer
backends, with sift (rupurt/sift) as a new option next to the existing
project-rag MCP server. Backend is selected via CodebaseMapPreferences.
- code-intelligence.ts: new abstraction + sift backend (detect, resolve,
status, context-block contribution)
- preferences-types.ts: codebaseIndexer field (project-rag | sift | none)
- preferences-validation.ts: validate the new field
- bootstrap/system-context.ts, commands-codebase.ts: dispatch on backend
- tests/code-intelligence.test.ts: sift detection/resolution/status tests
(19 pass, 0 fail)
project-rag path unchanged and continues to work.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SubagentBackgroundJobManager tracks long-running subagent jobs with
status, abort support, and TTL-based eviction of completed results.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Human-oriented documentation of SF capabilities, with a script that
keeps it in sync with workflow-tools.ts and extension manifests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extracting a class method as a bare reference loses its 'this' context,
causing 'Cannot read properties of undefined' when minimax (or any
provider) triggers the flat-rate auth-mode lookup.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
buildExecuteTaskPrompt was not passing memoriesSection to loadPrompt,
causing headless auto to fail with a template variable error. Also
updated plan-slice-prompt.test.ts to supply the four template variables
(memoriesSection, runtimeContext, phaseAnchorSection, gatesToClose) that
were missing from the test fixture.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The resolver guarded on context.parentURL.includes('/src/') to identify
in-repo source files, but @google/gemini-cli-core installs to
node_modules/@google/gemini-cli-core/dist/src/ which also contains '/src/'.
Relative imports from that dist package (e.g. './config/config.js') were
incorrectly rewritten to './config/config.ts', causing ERR_MODULE_NOT_FOUND
on every test that transitively imports the google-gemini provider.
Fix: add !context.parentURL.includes('/node_modules/') guard.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
blocked-models.ts (new):
Persistent per-project blocklist at .sf/runtime/blocked-models.json.
loadBlockedModels / isModelBlocked / blockModel (file-lock-safe write).
milestone-summary-classifier.ts (new):
classifyMilestoneSummaryContent → "success" | "failure" | "unknown".
isTerminalMilestoneSummaryContent: failure summaries are NOT terminal —
lets auto-mode re-enter a milestone after a failed recovery summary.
state.ts:
Phase 1 (completeMilestoneIds) and Phase 2 (registry) now check
isTerminalMilestoneSummaryContent before treating a SUMMARY as complete.
A failure SUMMARY no longer prematurely parks a milestone.
error-classifier.ts:
Add "unsupported-model" ErrorClass kind with regex detection
(model + not-supported/unavailable/no-access + account/plan/tier).
Checked before "permanent" so /account/i in PERMANENT_RE doesn't swallow it.
auto-model-selection.ts:
Wire isModelBlocked() gate in selectAndApplyModel candidate loop:
skips provider-rejected models and continues to fallbacks.
bootstrap/agent-end-recovery.ts:
Handle cls.kind === "unsupported-model": blockModel(), try fallback chain
skipping already-blocked models, pause if no usable fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ports commit 7fb35ca58 from gsd2 (PR #4769) covering four issues:
#4761 — resolveCanonicalMilestoneRoot in worktree-manager.ts routes
validate-milestone through the live worktree path instead of stale
project-root state when a milestone worktree is active.
#4762 — auditOrphanedMilestoneBranches in auto-start.ts now surfaces
in-progress milestone branches with unmerged commits ahead of main
(previously only complete milestones were audited). Gated on
isClosedStatus so parked/other closed statuses are unaffected.
#4764 — worktree-telemetry.ts: typed emit helpers (emitWorktreeCreated,
emitWorktreeMerged, emitWorktreeOrphaned, emitAutoExit, emitWorktreeSync,
emitCanonicalRootRedirect, emitSliceMerged, emitMilestoneResquash) plus
summarizeWorktreeTelemetry aggregator and nearest-rank percentile().
Wired in: worktree-resolver.ts (create/merge events), auto-start.ts
(orphan telemetry), auto.ts stopAuto (auto-exit with normalized reason),
worktree-manager.ts (canonical-root-redirect). Surfaced in forensics.ts
via detectWorktreeOrphans and Worktree Telemetry sections.
#4765 — slice-cadence.ts: mergeSliceToMain squash-merges each slice's
commits onto main as soon as the slice passes validation (opt-in via
git.collapse_cadence: "slice"). resquashMilestoneOnMain collapses N
per-slice commits into one milestone commit at completion. Wired in
auto-post-unit.ts (slice merge after complete-slice with stopAuto on
conflict/error) and worktree-resolver.ts (resquash at mergeAndExit).
AutoSession.milestoneStartShas tracks the pre-first-slice SHA.
GitPreferences and preferences-validation.ts extended with
collapse_cadence and milestone_resquash fields.
Also ports /sf scan command: commands-scan.ts with parseScanArgs,
resolveScanDocuments, buildScanOutputPaths, and handleScan dispatching
a focused codebase assessment prompt to .sf/codebase/.
journal.ts: 9 new JournalEventType values for the telemetry events.
All changes are additive; default behavior (cadence="milestone") unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
reassess-roadmap: flip default from true → false. Most reassess units
conclude "roadmap is fine" burning a session for no change; the
plan-slice prompt now carries a JIT preamble at zero cost. (#4778)
tool-execution: always prefer toolDefinition.label when non-empty,
even when label === name — allows tools to display their canonical
name explicitly. (#4758)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds support for project-local SF extension plugins dropped in
.sf/extensions/. Trust-gated (requires pi trust), symlink-escape safe.
- ecosystem/sf-extension-api.ts: SFExtensionAPI wrapper exposing
getPhase() and getActiveUnit() to third-party handlers; updateSnapshot
refreshes state before_agent_start so handlers see current phase/unit
- ecosystem/loader.ts: discovers .sf/extensions/*.js, loads them via
dynamic import, dispatches factory(api) for each
- register-extension.ts: initializes ecosystemHandlers array, wires loader
- register-hooks.ts: before_agent_start refreshes snapshot then dispatches
ecosystem handlers before returning SF system prompt
- types.ts: SFActiveUnit interface (milestoneId/sliceId/taskId + titles)
- workflow-logger.ts: "ecosystem" added to LogComponent union
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes a bug where per-unit tool narrowing poisoned the policy gate for
subsequent units, causing "Model policy denied dispatch before prompt send"
errors on complete-slice and discuss-milestone (100% Win repro).
Four-part port from gsd2@817031b2a:
- ModelPolicyDispatchBlockedError class with per-model deny reasons
- TOOL_BASELINE WeakMap + clearToolBaseline/restoreToolBaseline lifecycle
- auto-model-selection: use getRequiredWorkflowToolsForAutoUnit as requiredTools
- auto/loop: catch ModelPolicyDispatchBlockedError as non-retryable (pause)
- auto.ts: wire clearToolBaseline at startAuto (fresh only) and stopAuto
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8 fixes from 3rd-pass scan:
1. web/components/sf/tempCodeRunnerFile.tsx: remove orphan VS Code
'Code Runner' artifact (850+ lines duplicated from shell-terminal.tsx).
Unreferenced but compiled into tsc project.
2. sf/phase-anchor.ts: writePhaseAnchor used plain writeFileSync — a crash
mid-write would corrupt the handoff checkpoint that readPhaseAnchor then
silently returns null for, losing cross-phase context. Switched to
atomicWriteSync (already used by sibling files).
3. sf/forensics.ts: same non-atomic writeFileSync on active-forensics.json
marker. Race with a concurrent reader produces an empty object and the
forensics session is lost. Switched to atomicWriteSync.
4. web/auto-dashboard-service.ts: paused-session.json existence was the
intended signal but a corrupt body silently dropped the paused flag so
the UI showed active. Now reports paused on file existence regardless
of body integrity, and warns on corruption.
5. sf/visualizer-data.ts: doctor-history.jsonl parser did .map(JSON.parse)
inside an outer catch. One corrupt line discarded 19 valid entries.
Per-line try/catch preserves the valid rows.
6. sf/files.ts: three parseInt calls without radix (step, total_steps,
totalSteps) — also missing || 0 fallback for NaN.
7. cli.ts: parseInt(process.versions.node) without radix. Split on '.' and
use radix 10 explicitly.
8. sf/slice-parallel-orchestrator.ts: silent 'catch {}' around spawn()
masked worker-spawn failures as 'no workers available'. Matches sibling
parallel-orchestrator.ts pattern — now logs via logWarning.
Skipped from the scan (need a real lock mechanism, not safe as a one-line
fix):
- sf/auto-dispatch.ts:164 (UAT counter race)
- sf/captures.ts:107 (CAPTURES.md append race)
Deferred (low-value):
- preferences-models.ts, key-manager.ts, auto-timers.ts silent catches
- dead variable in visualizer-data.ts
- google-gemini-cli.ts maxTokens clamp interaction
tsc --noEmit green at root.
Real bugs from 2nd-pass scan:
1. extension-registry.ts: discoverAllManifests skipped symlinked extension
dirs because Dirent.isDirectory() returns false for symlinks. Dev-workflow
symlinks under ~/.sf/agent/extensions/ were invisible to list/enable/
disable/info. Matches the regression documented in
symlink-extension-discovery.test.ts — the test inlines the correct logic,
but this callsite still had the buggy form. Now accepts isDirectory() ||
isSymbolicLink().
2. headless.ts SIGINT handler: client.stop() failures were double-silenced
(inner .catch(()=>{}), outer try{}catch{}). Interactive mode logs stop
errors to stderr. Restored head/headless parity — still fire-and-forget
(exit code is forced via process.exit) but failures are observable.
3. openai-codex-responses.ts SSE parser: malformed data frames were silently
dropped so broken streams looked identical to clean ones. Now debug-logs
the parse error with the chunk context so broken streams are
distinguishable in logs. Stream continues on bad chunk (one bad frame
shouldn't kill the whole generation).
4. web/cleanup-service.ts generated script: bare 'catch {}' around four native
git calls (nativeBranchList, nativeDetectMainBranch, nativeBranchListMerged,
nativeForEachRef). A failed main-branch detection silently left mainBranch
undefined-shaped, then the next native call operated on garbage. Now emits
console.warn so failures surface in the subprocess log.
5. web/undo-service.ts generated script: git revert failure was silenced;
when --no-commit failed, user saw commitsReverted=0 with no reason. Now
logs the revert error before attempting --abort (abort itself remains
best-effort silent).
False positives from the same scan (investigated and dismissed):
- auto-worktree.ts #2505: code uses ':(exclude).sf/milestones' pathspec +
shelter-and-restore, which is a better fix than the 'drop --include-untracked'
approach the test comment describes. Test comment is stale; source is correct.
- Lifecycle handler unhandled rejections across 5 extensions: extensions/runner.ts
already try/catches handler invocations and routes to emitError. Wrapping the
individual handlers would be redundant.
Build sometimes copies dist/resources/extensions/ without the top-level
markdown files (observed: SF-WORKFLOW.md absent in dist/resources/ while
extensions/ was present). existsSync(distRes) was true either way, so
SF_WORKFLOW_PATH pointed at a non-existent path and /sf failed with ENOENT.
Check for the specific file instead of the directory.
The per-session branded welcome overlay was added by the SF rebrand
(9d739dfa5) as a boxed 'Press any key to continue...' splash shown once
per sf session. In practice: Enter doesn't dismiss it and typing renders
as garbled characters behind the overlay, blocking every TUI launch.
Branding was redundant with the header (installed at session_start) and
the footer (git branch + model). Shortcuts are discoverable via help.
Deleting the overlay eliminates the hang vector entirely.
Legacy-extension migration warnings (migrations.ts 'Press any key...')
are unaffected — those are vendored upstream Pi code on a different
code path and only fire when deprecated extensions are present.
resolveModelId now prefers google-gemini-cli over google (direct API) for
bare Gemini/Gemma IDs, matching the operational default after the CLI-core
re-platform. google-vertex is still honoured when it's the current provider.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
postUnitPreVerification now calls stageOnly() for execute-task units when
action=commit, setting stagedPendingCommit=true and capturing task context.
postUnitPostVerification commits the staged index after the gate passes,
using a conventional-commit message built from the task context. Failure is
non-fatal (logWarning + UI warning). 11 structural tests cover the full
deferral lifecycle.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix 2: verification gate no longer passes when no commands are
configured. Empty-commands result now returns passed=false, skipped=true.
Updated verification-gate.test.ts; added skipped-result guard in
auto-verification.ts that warns and continues (not a hard failure).
Fix 3: split auto-verification.ts try/catch into two zones. Zone 1
(gate machinery: prefs load, task lookup, runVerificationGate,
captureRuntimeErrors, runDependencyAudit) catches → pauseAuto + return
"pause". Zone 2 (ancillary: evidence writes, UOK gate, notifications)
catches → logWarning + return "continue". Added verification-fail-
closed.test.ts with 11 structural tests.
Fix 1 (infrastructure): added stageOnly() + commitStaged() to
GitServiceImpl, added stagedPendingCommit flag to AutoSession (cleared
in reset()), marked the runTurnGitAction call site in
postUnitPreVerification with TODO(fix-1-deferral) for the final wiring.
Fix 4: timeout handler in runFinalize now captures hadStagedPending and
hadCommitted before nulling currentUnit. Clears stagedPendingCommit to
prevent orphaned deferred commits. Emits a diagnostic warning for each
case so operators know whether staged-but-uncommitted changes will be
absorbed or whether a commit landed before verification was skipped.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace separate dispatchHeadlessBootstrap with one flow:
- dispatchNewMilestoneDiscuss({ auto }) — auto=true uses headless
prompt + rootFiles seed, no pendingAutoStartMap; auto=false uses
discuss prompt with preparation, sets pendingAutoStartMap
- bootstrapNewMilestone() — project setup + ID reservation, called
directly from bootstrapAutoSession instead of the old wrapper
- injectTodoContext() — reads and deletes todo.md/TODO.md/SPEC.md at
project root, injects content as spec into any preamble; called
identically in auto and interactive flows
Removes dispatchHeadlessBootstrap entirely. auto-start.ts now calls
the primitives directly. All three showWorkflowEntry new-milestone
sites use dispatchNewMilestoneDiscuss({ auto: false }).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Antigravity (Google's IDE sandbox product, different from Gemini CLI) is
removed from:
src/onboarding.ts — drop from LLM_PROVIDER_IDS + OAuth-flow picker
src/pi-migration.ts — drop from LLM_PROVIDER_IDS migration list
src/web/onboarding-service.ts — drop from web-UI provider list
src/tests/integration/web-onboarding-contract.test.ts — update contract
src/resources/extensions/sf/doctor-providers.ts — drop from CLI_AUTH_PROVIDERS
src/resources/extensions/sf/key-manager.ts — drop UI listing
src/resources/extensions/sf-usage-bar/index.ts — delete entire quota fetcher block (~200 lines)
packages/pi-coding-agent/src/cli/args.ts — drop PI_AI_ANTIGRAVITY_VERSION doc
packages/pi-coding-agent/src/utils/proxy-server.ts — drop from claude provider chain
Reason: antigravity has no vendor-published core library we can embed
(unlike @google/gemini-cli-core for the Gemini CLI). Continuing to
hand-roll OAuth against daily-cloudcode-pa.sandbox.googleapis.com is
exactly the pattern Google has started banning for third-party tools.
Removing the code removes the ban risk.
pi-ai provider code, OAuth util, and models.generated entries for
google-antigravity are removed in follow-up commits (separated for
reviewability — each layer verified independently).
Build passes. Note: this is a breaking change for any user who had
google-antigravity configured — they'll need to migrate to
google-gemini-cli (OAuth), google (API key), or google-vertex.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Gemini had zero benchmark entries in model-benchmarks.json despite
being served by google-gemini-cli (OAuth provider, SF native), google
(API key), google-vertex, google-antigravity, openrouter, etc. Every
gemini-* model in the pi-ai catalog scored 0 in the benchmark selector
— effectively excluded from auto-selection even when allow-listed.
Published numbers from DeepMind model cards + Vellum LLM leaderboard +
Vals AI:
gemini-3-pro-preview: SWE-Verified 76.2, HLE 37.5, AIME25 95,
GPQA-D 91.9, MMLU-Pro 81.0
gemini-3.1-pro-preview: SWE-Verified 78, HLE 41, AIME 97,
GPQA-D 93, MMLU-Pro 83 (Feb 2026)
gemini-3-flash-preview: estimated from Pro-vs-Flash delta
gemini-2.5-pro: SWE-Verified 63.8, HLE 18.8, GPQA-D 84.0,
MMLU-Pro 86
gemini-2.5-flash: estimated from Pro-vs-Flash delta
Context windows reflect Gemini's 1M-2M token capability.
LiveCodeBench Pro Elo (2439 for Gemini 3 Pro) isn't in the 0-100
percent schema — skipped rather than forced. Future: add arena_elo-
style LCB Elo dimension to the schema if we start routing on it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When two models score identically in the benchmark selector — typically
the same underlying weights served by different endpoints — the
previous alphabetical tiebreaker picked wrong. dr-repo example:
zai/glm-5.1 score 84.7
opencode-go/glm-5.1 score 84.7
Both are the exact same GLM-5.1 weights. Alphabetical comparison made
opencode-go win ("o" < "z") even though zai is the NATIVE provider.
Fix: new `provider_preference` pref, an ordered list of providers.
Listed providers rank in order, unlisted fall after alphabetically.
Applied as the tie-breaker between score and alphabetical.
Global default shipped in ~/.sf/preferences.md:
kimi-coding, minimax, zai, mistral, ollama-cloud, opencode-go,
opencode
Native providers ranked before re-servers. Users can override per
project.
Verified: after the change, dr-repo picks zai/glm-5.1 as primary for
execute-task and gate-evaluate (was opencode-go/glm-5.1), and
kimi-coding/k2p5 stays primary for completion phases with its direct
provider winning over opencode re-servers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The original "normalise by populated weight" was too aggressive: a model
with 1 strong dimension (delta-fast: human_eval=92) outranked a model
with 4 strong dimensions (beta-coder: swe_bench=85, lcb=90, he=95,
ifeval=90) because both normalised to their own small average.
Fix: multiply normalised score by a confidence factor tied to how much
of the unit's profile the model actually populated. Confidence =
populated_weight / total_profile_weight, blended 50/50 with a flat floor
so sparse-but-strong specialists still rank when no generalist covers
the profile:
score = (weighted_sum / weight_total) * (0.5 + 0.5 * confidence)
Net effect on dr-repo's auto-resolve:
Before: After:
plan-milestone glm-5.1 plan-milestone MiniMax-M2.5
research-slice codestral research-slice mistral-large-2411
execute-task mistral-large execute-task opencode-go/glm-5.1
validate-m magistral validate-m MiniMax-M2.5
subagent mistral-large subagent kimi-coding/k2p5
MiniMax's broad coverage (8 populated dimensions from the M2 README)
now correctly outranks GLM-5.1's higher but narrower scores for
reasoning-heavy units. Matches user intuition that "MiniMax is really
powerful".
Also fixes findBenchmarkKey to try "<modelId>-latest" for date-suffixed
model variants — pi-ai catalogs "devstral-medium-2507" but benchmarks
only have "devstral-medium-latest"; matcher now bridges that.
12 regression tests cover:
- empty candidate pool
- each profile (reasoning/coding/lightweight) picks right champion
- swe_bench ↔ swe_bench_verified equivalence
- models with all-null benchmarks score 0 but stay in fallbacks
- sparse-strong beats dense-weak (confirms confidence multiplier
doesn't over-penalise specialists)
- provider diversification in fallback chain
- deterministic tie-breaking
- unknown unit types use default coding profile
- date-suffixed model IDs match family-latest keys
Audit: 41 of 85 allow-listed models in pi-ai catalog have benchmark
data. 44 score 0 (mostly opencode Zen re-served models, ministral
small variants, pixtral vision models, legacy open-mistral). Top
picks for every dr-repo unit type DO have benchmark data — the gap
is in the long tail of fallbacks, which never matter unless the
primary and closer fallbacks all fail.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New module src/resources/extensions/sf/benchmark-selector.ts implements
benchmark-driven model selection. When models.<unit> is not pinned,
preferences-models.ts falls through to pick the highest-scoring
candidate from allowed_providers × pi-ai's model catalog, ranked
against a per-unit-type weight profile.
Weight profiles per unit type:
plan-milestone / plan-slice → agent-planning (swe_bench .25, lcb
.20, hle .15, gpqa .15, mmlu_pro .15,
aime .10)
research-* → mixed (mmlu_pro, hle, human_eval,
browse_comp, simple_qa, gpqa)
execute-task → coding (swe_bench .35, swe_bench_v
.25, lcb .20, human_eval .15)
execution_simple / complete-* → fast+correct (human_eval .40,
instruction_following .35, ruler .25)
gate-evaluate → review (swe_bench .30, hle .25,
gpqa .25, ifeval .20)
validate-milestone → validation (hle .30, gpqa .25,
mmlu_pro .25, swe_bench .20)
Key design decisions:
- Missing dimensions are dropped (normalised by populated weight),
so a model with 2 strong populated scores isn't crushed by a peer
with 5 mediocre ones.
- swe_bench ↔ swe_bench_verified are fungible — some vendors publish
one, some the other; treat as equivalent.
- Provider diversification in fallbacks so one provider going 429
doesn't kill the whole chain.
- Score ties broken by coverage, then lexical — deterministic.
Also updates MiniMax-M2/M2.5/M2.7 benchmarks with real numbers from
the M2 official README (DeepWiki sourced) and MiniMax-M2.5 card
(minimax.io): swe_bench_verified 69.4→80.2, LCB 83, HLE 31.8 (w/
tools — more representative for agent work than no-tools 12.5),
AIME25 78, GPQA-D 78, MMLU-Pro 82. Context windows bumped to
weights-level: M2 400K, M2.5/M2.7 1M (endpoints may cap lower).
Verified end-to-end: with dr-repo's allow-list
(kimi-coding/minimax/zai/opencode-go/mistral) and models.* absent,
resolveModelWithFallbacksForUnit() returns:
plan-milestone → opencode-go/glm-5.1 (+3 fallbacks)
research-slice → mistral/codestral-latest
execute-task → mistral/mistral-large-latest
execution_simple → kimi-coding/k2p5
gate-evaluate → opencode-go/glm-5.1
validate-milestone → mistral/magistral-medium-latest
subagent → mistral/mistral-large-latest
Users can still pin individual units (existing models.* behaviour
unchanged) or rely fully on auto-selection by omitting them.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four related improvements that landed in the working tree after the
auto-hardening merge but hadn't been committed:
1. auth_error as a distinct error type (auth-storage + retry-handler).
Previously invalid/expired API keys would retry the same failing
credential until the retry budget exhausted. Now:
- classifyErrorType() recognizes 401s, "invalid api key",
"authentication error", "unauthorized" etc as "auth_error"
- RetryHandler triggers cross-provider fallback on auth_error just
like it does for rate_limit and quota_exhausted — switch
providers rather than burning retries on a broken key
Outcome: a stale OPENCODE_API_KEY in sops now fails over to kimi or
minimax immediately instead of stalling the unit.
2. Multi-provider search-key detection (native-search.ts).
The "Web search: Set BRAVE_API_KEY" warning fired whenever a
non-Anthropic model lacked BRAVE_API_KEY, even when the user had
TAVILY_API_KEY or OLLAMA_API_KEY available. Now: the warning
suppresses if any of BRAVE/TAVILY/OLLAMA keys is present, and the
warning text lists all three options. Matches the preferences-
validation allow-list for search_provider.
3. MiniMax-M2.7-highspeed benchmark entry (model-benchmarks.json).
Routes the fast-tier variant of M2.7 through the Bayesian blender
with inherited RULER scores. Lets dynamic routing consider the
highspeed model when speed matters more than peak quality.
No regressions: the 41 pre-existing test failures in pi-coding-agent
(FallbackResolver chain-membership + LSP integration) are unchanged
relative to the prior commit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>