singularity-forge

History

Mikael Hugo 0b8a1c246f auto-benchmark model selection: pick best-scoring per unit type New module src/resources/extensions/sf/benchmark-selector.ts implements benchmark-driven model selection. When models.<unit> is not pinned, preferences-models.ts falls through to pick the highest-scoring candidate from allowed_providers × pi-ai's model catalog, ranked against a per-unit-type weight profile. Weight profiles per unit type: plan-milestone / plan-slice → agent-planning (swe_bench .25, lcb .20, hle .15, gpqa .15, mmlu_pro .15, aime .10) research-* → mixed (mmlu_pro, hle, human_eval, browse_comp, simple_qa, gpqa) execute-task → coding (swe_bench .35, swe_bench_v .25, lcb .20, human_eval .15) execution_simple / complete-* → fast+correct (human_eval .40, instruction_following .35, ruler .25) gate-evaluate → review (swe_bench .30, hle .25, gpqa .25, ifeval .20) validate-milestone → validation (hle .30, gpqa .25, mmlu_pro .25, swe_bench .20) Key design decisions: - Missing dimensions are dropped (normalised by populated weight), so a model with 2 strong populated scores isn't crushed by a peer with 5 mediocre ones. - swe_bench ↔ swe_bench_verified are fungible — some vendors publish one, some the other; treat as equivalent. - Provider diversification in fallbacks so one provider going 429 doesn't kill the whole chain. - Score ties broken by coverage, then lexical — deterministic. Also updates MiniMax-M2/M2.5/M2.7 benchmarks with real numbers from the M2 official README (DeepWiki sourced) and MiniMax-M2.5 card (minimax.io): swe_bench_verified 69.4→80.2, LCB 83, HLE 31.8 (w/ tools — more representative for agent work than no-tools 12.5), AIME25 78, GPQA-D 78, MMLU-Pro 82. Context windows bumped to weights-level: M2 400K, M2.5/M2.7 1M (endpoints may cap lower). Verified end-to-end: with dr-repo's allow-list (kimi-coding/minimax/zai/opencode-go/mistral) and models.* absent, resolveModelWithFallbacksForUnit() returns: plan-milestone → opencode-go/glm-5.1 (+3 fallbacks) research-slice → mistral/codestral-latest execute-task → mistral/mistral-large-latest execution_simple → kimi-coding/k2p5 gate-evaluate → opencode-go/glm-5.1 validate-milestone → mistral/magistral-medium-latest subagent → mistral/mistral-large-latest Users can still pin individual units (existing models.* behaviour unchanged) or rely fully on auto-selection by omitting them. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-04-19 09:43:26 +02:00
..
async-jobs	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
aws-auth	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
bg-shell	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
browser-tools	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
claude-code-cli	fix(pi-ai): wire thinking:{type} field and extend adaptive-thinking model coverage (#4392 )	2026-04-18 13:38:30 +02:00
cmux	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
context7	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
genai-proxy	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
github-sync	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
google-search	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
guardrails	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
mac-tools	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
mcp-client	preferences + mcp-client: resolve from main worktree and add global MCP config	2026-04-19 08:53:27 +02:00
ollama	ollama: make extension opt-in via OLLAMA_HOST	2026-04-19 05:53:45 +02:00
remote-questions	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
search-the-web	core + search + benchmarks: auth-error recovery, multi-provider search, M2.7-highspeed entry	2026-04-19 09:24:54 +02:00
sf	auto-benchmark model selection: pick best-scoring per unit type	2026-04-19 09:43:26 +02:00
sf-notify	Fix all 26 failing tests (22 rebrand artifacts + 4 RTK seam bugs)	2026-04-18 13:07:09 +02:00
sf-permissions	sf-tui + sf-permissions: gate footer-indicator side-effects on ctx.hasUI	2026-04-19 07:59:36 +02:00
sf-tui	sf-tui + sf-permissions: gate footer-indicator side-effects on ctx.hasUI	2026-04-19 07:59:36 +02:00
sf-usage-bar	Fix all 26 failing tests (22 rebrand artifacts + 4 RTK seam bugs)	2026-04-18 13:07:09 +02:00
shared	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
slash-commands	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
subagent	subagent: add per-call model override (Phase 1 of skill dispatch)	2026-04-19 05:22:07 +02:00
ttsr	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
universal-config	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
voice	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
ask-user-questions.ts	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
get-secrets-from-user.ts	Rename @sf-run/* → @singularity-forge/* package scope	2026-04-15 22:56:33 +02:00
package.json	Improve startup performance with lazy extension loading (#1336 )	2026-03-19 07:38:50 -06:00