New module src/resources/extensions/sf/benchmark-selector.ts implements
benchmark-driven model selection. When models.<unit> is not pinned,
preferences-models.ts falls through to pick the highest-scoring
candidate from allowed_providers × pi-ai's model catalog, ranked
against a per-unit-type weight profile.
Weight profiles per unit type:
plan-milestone / plan-slice → agent-planning (swe_bench .25, lcb
.20, hle .15, gpqa .15, mmlu_pro .15,
aime .10)
research-* → mixed (mmlu_pro, hle, human_eval,
browse_comp, simple_qa, gpqa)
execute-task → coding (swe_bench .35, swe_bench_v
.25, lcb .20, human_eval .15)
execution_simple / complete-* → fast+correct (human_eval .40,
instruction_following .35, ruler .25)
gate-evaluate → review (swe_bench .30, hle .25,
gpqa .25, ifeval .20)
validate-milestone → validation (hle .30, gpqa .25,
mmlu_pro .25, swe_bench .20)
Key design decisions:
- Missing dimensions are dropped (normalised by populated weight),
so a model with 2 strong populated scores isn't crushed by a peer
with 5 mediocre ones.
- swe_bench ↔ swe_bench_verified are fungible — some vendors publish
one, some the other; treat as equivalent.
- Provider diversification in fallbacks so one provider going 429
doesn't kill the whole chain.
- Score ties broken by coverage, then lexical — deterministic.
Also updates MiniMax-M2/M2.5/M2.7 benchmarks with real numbers from
the M2 official README (DeepWiki sourced) and MiniMax-M2.5 card
(minimax.io): swe_bench_verified 69.4→80.2, LCB 83, HLE 31.8 (w/
tools — more representative for agent work than no-tools 12.5),
AIME25 78, GPQA-D 78, MMLU-Pro 82. Context windows bumped to
weights-level: M2 400K, M2.5/M2.7 1M (endpoints may cap lower).
Verified end-to-end: with dr-repo's allow-list
(kimi-coding/minimax/zai/opencode-go/mistral) and models.* absent,
resolveModelWithFallbacksForUnit() returns:
plan-milestone → opencode-go/glm-5.1 (+3 fallbacks)
research-slice → mistral/codestral-latest
execute-task → mistral/mistral-large-latest
execution_simple → kimi-coding/k2p5
gate-evaluate → opencode-go/glm-5.1
validate-milestone → mistral/magistral-medium-latest
subagent → mistral/mistral-large-latest
Users can still pin individual units (existing models.* behaviour
unchanged) or rely fully on auto-selection by omitting them.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>