singularity-forge/src/resources/extensions
Mikael Hugo 0b8a1c246f auto-benchmark model selection: pick best-scoring per unit type
New module src/resources/extensions/sf/benchmark-selector.ts implements
benchmark-driven model selection. When models.<unit> is not pinned,
preferences-models.ts falls through to pick the highest-scoring
candidate from allowed_providers × pi-ai's model catalog, ranked
against a per-unit-type weight profile.

Weight profiles per unit type:
  plan-milestone / plan-slice  → agent-planning (swe_bench .25, lcb
                                  .20, hle .15, gpqa .15, mmlu_pro .15,
                                  aime .10)
  research-*                    → mixed (mmlu_pro, hle, human_eval,
                                  browse_comp, simple_qa, gpqa)
  execute-task                  → coding (swe_bench .35, swe_bench_v
                                  .25, lcb .20, human_eval .15)
  execution_simple / complete-* → fast+correct (human_eval .40,
                                  instruction_following .35, ruler .25)
  gate-evaluate                 → review (swe_bench .30, hle .25,
                                  gpqa .25, ifeval .20)
  validate-milestone            → validation (hle .30, gpqa .25,
                                  mmlu_pro .25, swe_bench .20)

Key design decisions:
  - Missing dimensions are dropped (normalised by populated weight),
    so a model with 2 strong populated scores isn't crushed by a peer
    with 5 mediocre ones.
  - swe_bench ↔ swe_bench_verified are fungible — some vendors publish
    one, some the other; treat as equivalent.
  - Provider diversification in fallbacks so one provider going 429
    doesn't kill the whole chain.
  - Score ties broken by coverage, then lexical — deterministic.

Also updates MiniMax-M2/M2.5/M2.7 benchmarks with real numbers from
the M2 official README (DeepWiki sourced) and MiniMax-M2.5 card
(minimax.io): swe_bench_verified 69.4→80.2, LCB 83, HLE 31.8 (w/
tools — more representative for agent work than no-tools 12.5),
AIME25 78, GPQA-D 78, MMLU-Pro 82. Context windows bumped to
weights-level: M2 400K, M2.5/M2.7 1M (endpoints may cap lower).

Verified end-to-end: with dr-repo's allow-list
(kimi-coding/minimax/zai/opencode-go/mistral) and models.* absent,
resolveModelWithFallbacksForUnit() returns:
  plan-milestone     → opencode-go/glm-5.1 (+3 fallbacks)
  research-slice     → mistral/codestral-latest
  execute-task       → mistral/mistral-large-latest
  execution_simple   → kimi-coding/k2p5
  gate-evaluate      → opencode-go/glm-5.1
  validate-milestone → mistral/magistral-medium-latest
  subagent           → mistral/mistral-large-latest

Users can still pin individual units (existing models.* behaviour
unchanged) or rely fully on auto-selection by omitting them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 09:43:26 +02:00
..
async-jobs Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
aws-auth Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
bg-shell Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
browser-tools Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
claude-code-cli fix(pi-ai): wire thinking:{type} field and extend adaptive-thinking model coverage (#4392) 2026-04-18 13:38:30 +02:00
cmux Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
context7 Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
genai-proxy Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
github-sync Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
google-search Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
guardrails Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
mac-tools Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
mcp-client preferences + mcp-client: resolve from main worktree and add global MCP config 2026-04-19 08:53:27 +02:00
ollama ollama: make extension opt-in via OLLAMA_HOST 2026-04-19 05:53:45 +02:00
remote-questions Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
search-the-web core + search + benchmarks: auth-error recovery, multi-provider search, M2.7-highspeed entry 2026-04-19 09:24:54 +02:00
sf auto-benchmark model selection: pick best-scoring per unit type 2026-04-19 09:43:26 +02:00
sf-notify Fix all 26 failing tests (22 rebrand artifacts + 4 RTK seam bugs) 2026-04-18 13:07:09 +02:00
sf-permissions sf-tui + sf-permissions: gate footer-indicator side-effects on ctx.hasUI 2026-04-19 07:59:36 +02:00
sf-tui sf-tui + sf-permissions: gate footer-indicator side-effects on ctx.hasUI 2026-04-19 07:59:36 +02:00
sf-usage-bar Fix all 26 failing tests (22 rebrand artifacts + 4 RTK seam bugs) 2026-04-18 13:07:09 +02:00
shared Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
slash-commands Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
subagent subagent: add per-call model override (Phase 1 of skill dispatch) 2026-04-19 05:22:07 +02:00
ttsr Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
universal-config Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
voice Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
ask-user-questions.ts Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
get-secrets-from-user.ts Rename @sf-run/* → @singularity-forge/* package scope 2026-04-15 22:56:33 +02:00
package.json Improve startup performance with lazy extension loading (#1336) 2026-03-19 07:38:50 -06:00