Commit graph

2509 commits

Author SHA1 Message Date
Mikael Hugo
a8cf2cd941 feat(workflow): add product-audit (slim port)
Milestone-end workflow that compares declared product intent (VISION.md,
RUNBOOKS.md, etc.) against actual code/test/deploy/docs evidence and
emits structured gaps with severity. Soft gates — adds follow-up slices
but doesn't hard-block merge.

Slim port (4 new files + 1 registration) — extracts only the audit
feature itself, not bunker's parallel rewrite of dispatch/prompts/
benchmark-selector that came with it in commit 2aa785475.

Created:
- prompts/product-audit.md         — prompt verbatim, gsd_*→sf_* and .gsd→.sf
- tools/product-audit-tool.ts      — slim file-write implementation,
                                     atomicWriteAsync to .sf/active/{mid}/
                                     PRODUCT-AUDIT.{json,md}; no DB deps
- bootstrap/product-audit-tool.ts  — pi-coding-agent tool registration,
                                     TypeBox schema for sf_product_audit
- workflow-templates/product-audit.md — workflow template

Modified:
- bootstrap/register-extension.ts  — 2 lines: import + add to nonCriticalRegistrations
- workflow-templates/registry.json — registry entry
- package.json — version 2.75.0 → 2.75.1

Verdict logic (no-gaps | gaps-found | contract-underspecified) is the
load-bearing innovation: contract-underspecified forces the auditor to
flag unverifiable docs as a real gap rather than rubber-stamping
no-gaps when the product contract is silent.

Out of scope: phase enum changes, dispatch hookup. Wire-up to the phase
machine is a follow-up; the prompt + tool + template stand alone.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 13:55:23 +02:00
Mikael Hugo
2eebeccb93 feat(search): add MiniMax web search provider
New search backend alongside tavily/brave/serper/exa/ollama. API key
resolution: MINIMAX_CODE_PLAN_KEY → MINIMAX_CODING_API_KEY →
MINIMAX_API_KEY (fallback order matches MiniMax's documented aliases).

Wired through every existing seam:
- type union: SearchProvider = 'tavily' | 'minimax' | 'brave' | 'ollama'
- VALID_PREFERENCES set + selection logic in provider.ts
- native-search routing (Anthropic native web_search delegates correctly)
- /search-provider CLI command (tab completion, select UI, parser)
- tool-search.ts: search execution path
- tool-llm-context.ts: prefetch / context-builder path
- preferences-types + preferences-validation
- configuration.md user docs
- extension-manifest description

Tests not added in this commit — the bunker reference tests don't match
our preferences/provider export shape (we have serper/exa/combosearch
that bunker doesn't). Tests for getMiniMaxSearchApiKey priority order,
resolveSearchProvider returning "minimax", /search-provider minimax CLI
behavior, no-key error messages, and executeMiniMaxSearch request shape
are TODO.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 13:55:04 +02:00
Mikael Hugo
dff0df5fdc fix(headless): suppress notification spam, categorize messages, distinguish phase vs status
Three small UX fixes for headless / autopilot logs:

1. Add `zz-notifications` to TUI_FOOTER_STATUS_KEYS — these are sticky
   notification dots from the interactive TUI footer; they have no
   meaning in headless and were spamming the log.

2. Categorize notification messages by prefix so headless output is
   scannable: [mcp] for MCP-client-ready, [search] for web search status,
   [parallel] for slice-parallel/subagent dispatch. Falls through to
   the existing important/non-important formatting for everything else.

3. Distinguish phase transitions from generic status updates: phase:/
   milestone:/slice:/task: prefixed keys get [phase]; everything else
   gets [status]. Previously both used [phase], which was misleading.

Patterns based on bunker commits 14ec4d97f / c15afb45f (which were the
research source) but written fresh against our existing
TUI_FOOTER_STATUS_KEYS structure rather than cherry-picked.

The assistant-text-preview commit (cf0274c63) is a separate, larger
refactor in headless.ts and is deferred to v3.1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 13:43:40 +02:00
Mikael Hugo
c41912ff55 fix(prompts): tell agents about Serena (repo-intelligence MCP) for code exploration
We have .serena/ configured (cache, memories, project.local.yml) but no
prompt mentioned Serena anywhere. Agents weren't using it for symbol
lookup or cross-file architecture mapping; they fell straight to rg/find.

Added a one-sentence Serena hint to the code-exploration step in:
- research-slice.md
- research-milestone.md
- plan-slice.md
- plan-milestone.md
- guided-research-slice.md

Phrased generically ("If a repo-intelligence MCP (e.g. Serena) is
configured...") so it degrades cleanly when Serena isn't set up.

Pattern based on bunker commit 4ba746888 but written fresh against our
post-rename prompt structure rather than cherry-picked.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 13:41:33 +02:00
Mikael Hugo
b24f426f2b batch: snapshot of in-flight v2 work
This commit captures uncommitted modifications that accumulated in the
working tree across multiple in-progress workstreams. It is a snapshot
to clear the deck before sf v3 work begins; individual workstreams
should land separately on top of this.

Notable additions:
- trace-collector.ts, traces.ts, src/tests/trace-export.test.ts —
  trace export plumbing
- biome.json — Biome linter configuration
- .gitignore — exclude native/npm/**/*.node compiled binaries

The bulk of the diff is across src/resources/extensions/sf/ (301 files)
and src/resources/extensions/sf/tests/ (277 files), reflecting the
ongoing sf extension work. Specific feature commits should follow this
snapshot rather than being archaeology'd out of it.

The 76MB native/npm/linux-x64-gnu/forge_engine.node compiled binary
was left out of the commit — it's now gitignored and built locally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 12:42:31 +02:00
Mikael Hugo
6eaf5926ad sf snapshot: uncommitted changes after 248m inactivity 2026-04-28 21:10:17 +02:00
Mikael Hugo
d30d91bf2f sf snapshot: uncommitted changes after 41m inactivity 2026-04-28 17:01:26 +02:00
Mikael Hugo
5d3c204006 fix(git-merge): no auto-flip from approved to declined; cached approval is sticky
Codex-rescue output (a299c461 / bnr88iy59) — the 'Git merge approved once'
followed seconds later by 'Git merge declined by user' bug we hit on
M002 complete-milestone. Same gate, same agent run, opposite verdicts.

Single source of truth for the merge-gate state in guardrails/index.ts.
Approval is now sticky — re-asks return the cached approval until consumed
or explicitly revoked, never auto-flip to decline. Timeout converts to
pause+log instead of decline. Adds tests/safe-git-merge-gate.test.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: OpenAI Codex <noreply@openai.com>
2026-04-28 16:20:08 +02:00
Mikael Hugo
d38e5ea092 fix(schema): auto-coerce string → [string] for sf_* list fields + provider_model_allow tests
Two codex-rescue tasks landed together:

1. Auto-coerce JSON-schema validator: when a tool field declares
   {type:"array", items:{type:"string"}} and the model sends a single
   string, wrap it in [string] before validation instead of hard-rejecting.
   Fixes the recurring "keyDecisions: must be array" rejection on
   sf_complete_task that wasted retries.

2. Provider_model_allow filter (proper implementation with helpers):
   - resolveProviderModelAllowList / isProviderModelAllowed /
     filterModelsByProviderModelAllow helpers in preferences-models
   - Wired into model-registry and auto-model-selection
   - New tests/provider-model-allow.test.ts

Tools coerced: sf_complete_task, sf_complete_milestone, sf_plan_milestone,
sf_plan_slice, sf_replan_slice, sf_reassess_roadmap (key list fields).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: OpenAI Codex <noreply@openai.com>
2026-04-28 12:30:55 +02:00
Mikael Hugo
f98a1e360e batch: codex-rescue session output (multiple in-flight tasks)
Combined output of multiple parallel codex-rescue runs that produced
working-tree edits but didn't commit. Tasks contributing:

- prefs: per-provider model allow-list (provider_model_allow) — manual
- TUI scroll + unresponsive (a7884d1a / bt3fpn4y2)
- planningMeeting required (aa09e904 / br127l763)
- Logs UX 4-pack (a5c65314 / btcplhu7f)
- Gate auto-resolve + completion nudge (ae4c8b64 / bw1w1fjkp)
- sf_task_complete atomic + retry (a7a079b4 / b20cy5owv)
- Multi-model meeting + minimax M2.7 + draft promotion (a756faac / task-moifjknd-lwjc98)
- Per-role slice prompts (a94c3e1a)
- Per-role vision-meeting prompts (afd165a0 / task-moifple5-lcwtjl)
- Schema sweep (ac994b1e / task-moifq7pu-83coqz)
- Flow audit (ad26ecfd / bttj4vrqm)

Typecheck passes. Tests not run as a full suite — spot-check after merge.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: OpenAI Codex <noreply@openai.com>
2026-04-28 11:52:42 +02:00
Mikael Hugo
66ff949c11 cherry-pick(security): harden project-controlled surfaces (PR #4755 partial)
Cherry-pick of gsd-build/gsd-2 65ca5aa2e — applies the security hardening
hunks that conflicted minimally:

- mcp-server/env-writer: validate writes against a strict allowlist
- web/api/files: enforce path containment via web/lib/secure-path
- vscode-extension: read binaryPath/autoStart only from trusted
  global/default scopes (resolveTrustedSfStartupConfig), avoiding
  workspace-controlled override (renamed Gsd → Sf for sf naming)
- New regression tests: mcp-client-security, vscode-startup-security,
  web-files-symlink

Skipped hunks (drifted): mcp-server/server.ts, mcp-client/index.ts,
mcp-server/README.md.

Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 05:37:07 +02:00
Mikael Hugo
bf727173e7 cherry-pick(file-lock): make file-lock actually lock and throw on contention
Cherry-pick of gsd-build/gsd-2 a09e01640 — withFileLockSync now actually
acquires a proper-lockfile (was previously a no-op when proper-lockfile
wasn't required) and throws on ELOCKED contention by default. Adds
onLocked: 'skip' option for best-effort callers that tolerate dropped
entries (audit, journal). Modernizes import style (createRequire/join
from imports rather than ad-hoc require). Path-renames preserved
(gsd-pi → sf-run).

Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 05:28:36 +02:00
Mikael Hugo
22d4579690 cherry-pick(state): lock-wrapped appends for journal, audit, workflow-logger
Cherry-pick of gsd-build/gsd-2 53babec29 — lock-wrapped append half.
Wraps appends to .sf/journal/, .sf/audit/events.jsonl, and the
workflow-logger error log in withFileLockSync (onLocked: skip),
preserving best-effort semantics while preventing torn writes
under contention.

Companion to the atomic-write half landed in 3df56cb94. Path-renames
(gsdRoot→sfRoot, gsd-db→sf-db) preserved during conflict resolution.

Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 05:27:44 +02:00
Mikael Hugo
f1f4b840e1 cherry-pick(doctor): self-heal symlinked .sf staging to prevent silent data loss
Cherry-pick of gsd-build/gsd-2 9340f1e9b (#4423) — doctor self-heal
detection for symlinked staging directories that can cause silent
data loss. Skips native-git-bridge.ts and git-service test (drifted).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 05:25:56 +02:00
Mikael Hugo
7fd4672e55 cherry-pick(auto): handle worktree context fallback + sanitize paused session paths
Cherry-pick of gsd-build/gsd-2 a4f78731f — handles worktree context fallback
and sanitizes paths in paused session resumption. Skips uok-plan-v2-wiring
test hunk (drifted in sf).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 05:25:40 +02:00
Mikael Hugo
93402643f4 cherry-pick(sf-db): tolerate corrupt task arrays in milestone rows
Cherry-pick of gsd-build/gsd-2 851507913 (#4056) — defensive parsing
so a corrupt or non-array tasks blob in a milestone row doesn't crash
sf-db reads. Test hunk skipped (sf-db.test.ts has drifted).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 05:25:21 +02:00
Mikael Hugo
3df56cb94f cherry-pick(state): atomic-writes for guided-flow-queue and reports
Cherry-pick of gsd-build/gsd-2 53babec29 (Jeremy <jeremy@fluxlabs.net>)
— atomic-write half only. Eliminates torn-write risk on PROJECT.md
queue sync and reports.json/HTML index regeneration by switching
writeFileSync → atomicWriteSync (tmp+rename).

The companion lock-wrapped-append changes (journal.ts, uok/audit.ts,
workflow-logger.ts) are deferred — they need proper-lockfile +
withFileLockSync helper introduced first.

Co-Authored-By: Jeremy <jeremy@fluxlabs.net>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 05:16:39 +02:00
Mikael Hugo
8e827147c9 feat(code-intelligence): add sift indexer backend alongside project-rag
Generalize the code-intelligence hook to support multiple indexer
backends, with sift (rupurt/sift) as a new option next to the existing
project-rag MCP server. Backend is selected via CodebaseMapPreferences.

- code-intelligence.ts: new abstraction + sift backend (detect, resolve,
  status, context-block contribution)
- preferences-types.ts: codebaseIndexer field (project-rag | sift | none)
- preferences-validation.ts: validate the new field
- bootstrap/system-context.ts, commands-codebase.ts: dispatch on backend
- tests/code-intelligence.test.ts: sift detection/resolution/status tests
  (19 pass, 0 fail)

project-rag path unchanged and continues to work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 05:05:26 +02:00
Mikael Hugo
0606983d97 feat(subagent): add background job manager and tests
SubagentBackgroundJobManager tracks long-running subagent jobs with
status, abort support, and TTL-based eviction of completed results.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 04:18:17 +02:00
Mikael Hugo
efd5e14e0a feat: add FEATURES.md capability map and generator
Human-oriented documentation of SF capabilities, with a script that
keeps it in sync with workflow-tools.ts and extension manifests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 04:18:12 +02:00
Mikael Hugo
0d286b991b sf snapshot: pre-dispatch, uncommitted changes after 2902m inactivity 2026-04-27 23:42:51 +02:00
Mikael Hugo
f0da5b6d21 fix: bind getProviderAuthMode to registry instance to avoid undefined 'this'
Extracting a class method as a bare reference loses its 'this' context,
causing 'Cannot read properties of undefined' when minimax (or any
provider) triggers the flat-rate auth-mode lookup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 19:22:39 +02:00
Mikael Hugo
7289933909 fix: populate memoriesSection in execute-task prompt and fix stale dist
buildExecuteTaskPrompt was not passing memoriesSection to loadPrompt,
causing headless auto to fail with a template variable error. Also
updated plan-slice-prompt.test.ts to supply the four template variables
(memoriesSection, runtimeContext, phaseAnchorSection, gatesToClose) that
were missing from the test fixture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 18:46:55 +02:00
Mikael Hugo
a30a7692e3 fix: dist-redirect.mjs incorrectly rewrites .js→.ts for node_modules paths containing /src/
The resolver guarded on context.parentURL.includes('/src/') to identify
in-repo source files, but @google/gemini-cli-core installs to
node_modules/@google/gemini-cli-core/dist/src/ which also contains '/src/'.
Relative imports from that dist package (e.g. './config/config.js') were
incorrectly rewritten to './config/config.ts', causing ERR_MODULE_NOT_FOUND
on every test that transitively imports the google-gemini provider.

Fix: add !context.parentURL.includes('/node_modules/') guard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 18:04:23 +02:00
Mikael Hugo
2e32c96fa0 Port gsd2 functional parity: turn-epoch, abandon-detect, reapplyThinking, exec chain, memory chain, onboarding-state
- auto/turn-epoch.ts: AsyncLocalStorage-backed stale-write dropping for timeout recovery
- journal.ts: isStaleWrite() guard drops superseded turn writes
- auto/run-unit.ts: wrap agent_end Promise.race in runWithTurnGeneration
- auto/session.ts: ThinkingLevelSnapshot type + autoModeStartThinkingLevel/originalThinkingLevel fields
- auto-model-selection.ts: reapplyThinkingLevel() called after every successful setModel()
- auto/phases.ts: pass autoModeStartThinkingLevel to selectAndApplyModel + hook override restore
- abandon-detect.ts: two-signal milestone abandon detection in rewrite-docs overrides
- auto-post-unit.ts: use detectAbandonMilestone + parkMilestone in rewrite-docs handler
- preferences-types.ts: ContextModeConfig + isContextModeEnabled
- exec-sandbox.ts: sandboxed bash/node/python subprocess with .sf/exec/ persistence
- exec-history.ts: read-side scan of .sf/exec/*.meta.json
- compaction-snapshot.ts: ≤2 KB markdown digest written before context compaction
- tools/exec-tool.ts: sf_exec MCP tool executor
- tools/exec-search-tool.ts: sf_exec_search MCP tool executor
- tools/resume-tool.ts: sf_resume MCP tool executor
- bootstrap/exec-tools.ts: registers sf_exec/sf_exec_search/sf_resume
- memory-relations.ts: knowledge-graph edges between memories (traverseGraph)
- tools/memory-tools.ts: capture_thought/memory_query/sf_graph executors
- bootstrap/memory-tools.ts: registers capture_thought/memory_query/sf_graph
- bootstrap/register-extension.ts: wire exec-tools + memory-tools into registration
- onboarding-state.ts: onboarding completion record at ~/.sf/agent/onboarding.json

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 10:58:39 +02:00
Mikael Hugo
5887ea3fd1 port gsd2: blocked-models gate, milestone-summary classifier, unsupported-model recovery
blocked-models.ts (new):
  Persistent per-project blocklist at .sf/runtime/blocked-models.json.
  loadBlockedModels / isModelBlocked / blockModel (file-lock-safe write).

milestone-summary-classifier.ts (new):
  classifyMilestoneSummaryContent → "success" | "failure" | "unknown".
  isTerminalMilestoneSummaryContent: failure summaries are NOT terminal —
  lets auto-mode re-enter a milestone after a failed recovery summary.

state.ts:
  Phase 1 (completeMilestoneIds) and Phase 2 (registry) now check
  isTerminalMilestoneSummaryContent before treating a SUMMARY as complete.
  A failure SUMMARY no longer prematurely parks a milestone.

error-classifier.ts:
  Add "unsupported-model" ErrorClass kind with regex detection
  (model + not-supported/unavailable/no-access + account/plan/tier).
  Checked before "permanent" so /account/i in PERMANENT_RE doesn't swallow it.

auto-model-selection.ts:
  Wire isModelBlocked() gate in selectAndApplyModel candidate loop:
  skips provider-rejected models and continues to fallbacks.

bootstrap/agent-end-recovery.ts:
  Handle cls.kind === "unsupported-model": blockModel(), try fallback chain
  skipping already-blocked models, pause if no usable fallback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 10:13:27 +02:00
Mikael Hugo
6cb6de4fd2 perf: parallelize I/O, add runtime cache, extend nix devenv
- unit-context-composer: resolve artifact keys in parallel (Promise.all)
- unit-runtime: add in-memory cache to avoid repeated disk reads per dispatch
- auto-timers: share 15s idle watchdog tick with context-pressure check
- auto-prompts: 1s TTL budget cache to coalesce repeated loadEffectiveSFPreferences calls
- native-git-bridge: extend nativeHasChanges TTL 10s→30s
- auto-dashboard: remove pulsing dot animation (CPU churn, no UX value)
- flake.nix: add nodePackages.typescript to dev shell

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 10:12:32 +02:00
Mikael Hugo
12aabd863e port gsd2 #4769: worktree telemetry, slice-cadence, canonical-root fix + /sf scan
Ports commit 7fb35ca58 from gsd2 (PR #4769) covering four issues:

#4761 — resolveCanonicalMilestoneRoot in worktree-manager.ts routes
validate-milestone through the live worktree path instead of stale
project-root state when a milestone worktree is active.

#4762 — auditOrphanedMilestoneBranches in auto-start.ts now surfaces
in-progress milestone branches with unmerged commits ahead of main
(previously only complete milestones were audited). Gated on
isClosedStatus so parked/other closed statuses are unaffected.

#4764 — worktree-telemetry.ts: typed emit helpers (emitWorktreeCreated,
emitWorktreeMerged, emitWorktreeOrphaned, emitAutoExit, emitWorktreeSync,
emitCanonicalRootRedirect, emitSliceMerged, emitMilestoneResquash) plus
summarizeWorktreeTelemetry aggregator and nearest-rank percentile().
Wired in: worktree-resolver.ts (create/merge events), auto-start.ts
(orphan telemetry), auto.ts stopAuto (auto-exit with normalized reason),
worktree-manager.ts (canonical-root-redirect). Surfaced in forensics.ts
via detectWorktreeOrphans and Worktree Telemetry sections.

#4765 — slice-cadence.ts: mergeSliceToMain squash-merges each slice's
commits onto main as soon as the slice passes validation (opt-in via
git.collapse_cadence: "slice"). resquashMilestoneOnMain collapses N
per-slice commits into one milestone commit at completion. Wired in
auto-post-unit.ts (slice merge after complete-slice with stopAuto on
conflict/error) and worktree-resolver.ts (resquash at mergeAndExit).
AutoSession.milestoneStartShas tracks the pre-first-slice SHA.
GitPreferences and preferences-validation.ts extended with
collapse_cadence and milestone_resquash fields.

Also ports /sf scan command: commands-scan.ts with parseScanArgs,
resolveScanDocuments, buildScanOutputPaths, and handleScan dispatching
a focused codebase assessment prompt to .sf/codebase/.

journal.ts: 9 new JournalEventType values for the telemetry events.
All changes are additive; default behavior (cadence="milestone") unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 09:03:56 +02:00
Mikael Hugo
2911d3b93d port gsd2: reassess-roadmap opt-in (ADR-003 §4) + prefer toolDefinition.label
reassess-roadmap: flip default from true → false. Most reassess units
conclude "roadmap is fine" burning a session for no change; the
plan-slice prompt now carries a JIT preamble at zero cost. (#4778)

tool-execution: always prefer toolDefinition.label when non-empty,
even when label === name — allows tools to display their canonical
name explicitly. (#4758)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 08:33:50 +02:00
Mikael Hugo
d4cdcb582d port gsd2 #3338: ecosystem plugin loader for .sf/extensions/
Adds support for project-local SF extension plugins dropped in
.sf/extensions/. Trust-gated (requires pi trust), symlink-escape safe.

- ecosystem/sf-extension-api.ts: SFExtensionAPI wrapper exposing
  getPhase() and getActiveUnit() to third-party handlers; updateSnapshot
  refreshes state before_agent_start so handlers see current phase/unit
- ecosystem/loader.ts: discovers .sf/extensions/*.js, loads them via
  dynamic import, dispatches factory(api) for each
- register-extension.ts: initializes ecosystemHandlers array, wires loader
- register-hooks.ts: before_agent_start refreshes snapshot then dispatches
  ecosystem handlers before returning SF system prompt
- types.ts: SFActiveUnit interface (milestoneId/sliceId/taskId + titles)
- workflow-logger.ts: "ecosystem" added to LogComponent union

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 08:27:55 +02:00
Mikael Hugo
6c36d62f35 port gsd2 #4961: stop using active-tool snapshot as model-policy gate
Fixes a bug where per-unit tool narrowing poisoned the policy gate for
subsequent units, causing "Model policy denied dispatch before prompt send"
errors on complete-slice and discuss-milestone (100% Win repro).

Four-part port from gsd2@817031b2a:
- ModelPolicyDispatchBlockedError class with per-model deny reasons
- TOOL_BASELINE WeakMap + clearToolBaseline/restoreToolBaseline lifecycle
- auto-model-selection: use getRequiredWorkflowToolsForAutoUnit as requiredTools
- auto/loop: catch ModelPolicyDispatchBlockedError as non-retryable (pause)
- auto.ts: wire clearToolBaseline at startAuto (fresh only) and stopAuto

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 08:15:04 +02:00
Mikael Hugo
4fdd8700a3 port gsd2 upstream features: scope classifier, composer v2, GPT-5.5, test timeout
- milestone-scope-classifier: add getMilestonePipelineVariant + milestoneRowToScopeInput
  wired into auto-dispatch trivial-skip for research/validation phases (#4781)
- auto-prompts: rename GSD→SF identifiers, add isSummaryCleanForSkip, prefs param
  on checkNeedsReassessment, buildExtractionStepsBlock from commands-extract-learnings
- unit-context-manifest + unit-context-composer: port v2 typed computed artifacts (#4924)
- skill-manifest: per-unit-type skill filter resolver (#4788, #4792)
- escalation: stub for ADR-011 mid-execution escalation (full port deferred)
- auto-start: extract decideSurvivorAction for testability (#4832)
- models: add gpt-5.5 + gpt-5.4-mini to cost table, router, and models.generated.ts
- types: EscalationArtifact, context_window_override, skip_clean_reassess,
  mid_execution_escalation, sketch_scope on SliceRow
- tool-execution: add visibleWidth import (was undefined)
- package.json: add --test-timeout=30000 to prevent parallel tests from freezing machine

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 08:08:11 +02:00
Mikael Hugo
e2147c0694 sf snapshot: pre-dispatch, uncommitted changes after 43m inactivity 2026-04-25 06:34:49 +02:00
Mikael Hugo
7b6c9dd099 sf snapshot: pre-dispatch, uncommitted changes after 4703m inactivity 2026-04-25 05:51:29 +02:00
ace-pm
c744bdf6c1
fix: atomic writes, parse radix, lossy json, silent worker spawn
8 fixes from 3rd-pass scan:

1. web/components/sf/tempCodeRunnerFile.tsx: remove orphan VS Code
   'Code Runner' artifact (850+ lines duplicated from shell-terminal.tsx).
   Unreferenced but compiled into tsc project.

2. sf/phase-anchor.ts: writePhaseAnchor used plain writeFileSync — a crash
   mid-write would corrupt the handoff checkpoint that readPhaseAnchor then
   silently returns null for, losing cross-phase context. Switched to
   atomicWriteSync (already used by sibling files).

3. sf/forensics.ts: same non-atomic writeFileSync on active-forensics.json
   marker. Race with a concurrent reader produces an empty object and the
   forensics session is lost. Switched to atomicWriteSync.

4. web/auto-dashboard-service.ts: paused-session.json existence was the
   intended signal but a corrupt body silently dropped the paused flag so
   the UI showed active. Now reports paused on file existence regardless
   of body integrity, and warns on corruption.

5. sf/visualizer-data.ts: doctor-history.jsonl parser did .map(JSON.parse)
   inside an outer catch. One corrupt line discarded 19 valid entries.
   Per-line try/catch preserves the valid rows.

6. sf/files.ts: three parseInt calls without radix (step, total_steps,
   totalSteps) — also missing || 0 fallback for NaN.

7. cli.ts: parseInt(process.versions.node) without radix. Split on '.' and
   use radix 10 explicitly.

8. sf/slice-parallel-orchestrator.ts: silent 'catch {}' around spawn()
   masked worker-spawn failures as 'no workers available'. Matches sibling
   parallel-orchestrator.ts pattern — now logs via logWarning.

Skipped from the scan (need a real lock mechanism, not safe as a one-line
fix):
- sf/auto-dispatch.ts:164 (UAT counter race)
- sf/captures.ts:107 (CAPTURES.md append race)

Deferred (low-value):
- preferences-models.ts, key-manager.ts, auto-timers.ts silent catches
- dead variable in visualizer-data.ts
- google-gemini-cli.ts maxTokens clamp interaction

tsc --noEmit green at root.
2026-04-21 02:13:10 +02:00
ace-pm
51b65fd490
fix: symlink extensions + silent catches masking real errors
Real bugs from 2nd-pass scan:

1. extension-registry.ts: discoverAllManifests skipped symlinked extension
   dirs because Dirent.isDirectory() returns false for symlinks. Dev-workflow
   symlinks under ~/.sf/agent/extensions/ were invisible to list/enable/
   disable/info. Matches the regression documented in
   symlink-extension-discovery.test.ts — the test inlines the correct logic,
   but this callsite still had the buggy form. Now accepts isDirectory() ||
   isSymbolicLink().

2. headless.ts SIGINT handler: client.stop() failures were double-silenced
   (inner .catch(()=>{}), outer try{}catch{}). Interactive mode logs stop
   errors to stderr. Restored head/headless parity — still fire-and-forget
   (exit code is forced via process.exit) but failures are observable.

3. openai-codex-responses.ts SSE parser: malformed data frames were silently
   dropped so broken streams looked identical to clean ones. Now debug-logs
   the parse error with the chunk context so broken streams are
   distinguishable in logs. Stream continues on bad chunk (one bad frame
   shouldn't kill the whole generation).

4. web/cleanup-service.ts generated script: bare 'catch {}' around four native
   git calls (nativeBranchList, nativeDetectMainBranch, nativeBranchListMerged,
   nativeForEachRef). A failed main-branch detection silently left mainBranch
   undefined-shaped, then the next native call operated on garbage. Now emits
   console.warn so failures surface in the subprocess log.

5. web/undo-service.ts generated script: git revert failure was silenced;
   when --no-commit failed, user saw commitsReverted=0 with no reason. Now
   logs the revert error before attempting --abort (abort itself remains
   best-effort silent).

False positives from the same scan (investigated and dismissed):
- auto-worktree.ts #2505: code uses ':(exclude).sf/milestones' pathspec +
  shelter-and-restore, which is a better fix than the 'drop --include-untracked'
  approach the test comment describes. Test comment is stale; source is correct.
- Lifecycle handler unhandled rejections across 5 extensions: extensions/runner.ts
  already try/catches handler invocations and routes to emitError. Wrapping the
  individual handlers would be redundant.
2026-04-21 02:01:41 +02:00
ace-pm
0f94341b43
fix(loader): fall back to src/resources when SF-WORKFLOW.md missing from dist
Build sometimes copies dist/resources/extensions/ without the top-level
markdown files (observed: SF-WORKFLOW.md absent in dist/resources/ while
extensions/ was present). existsSync(distRes) was true either way, so
SF_WORKFLOW_PATH pointed at a non-existent path and /sf failed with ENOENT.

Check for the specific file instead of the directory.
2026-04-21 01:39:18 +02:00
ace-pm
485e8f608e
chore: init sf 2026-04-21 01:38:02 +02:00
ace-pm
e6676692fc
fix(sf-tui): remove welcome overlay that hangs on enter
The per-session branded welcome overlay was added by the SF rebrand
(9d739dfa5) as a boxed 'Press any key to continue...' splash shown once
per sf session. In practice: Enter doesn't dismiss it and typing renders
as garbled characters behind the overlay, blocking every TUI launch.

Branding was redundant with the header (installed at session_start) and
the footer (git branch + model). Shortcuts are discoverable via help.
Deleting the overlay eliminates the hang vector entirely.

Legacy-extension migration warnings (migrations.ts 'Press any key...')
are unaffected — those are vendored upstream Pi code on a different
code path and only fire when deprecated extensions are present.
2026-04-21 00:44:28 +02:00
Mikael Hugo
38d3bd55da sf: route Gemini family models to google-gemini-cli by default
resolveModelId now prefers google-gemini-cli over google (direct API) for
bare Gemini/Gemma IDs, matching the operational default after the CLI-core
re-platform. google-vertex is still honoured when it's the current provider.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 20:33:43 +02:00
Mikael Hugo
822791fad3 sf: wire Fix 1 deferred-commit (stage-before-verify, commit-after-verify)
postUnitPreVerification now calls stageOnly() for execute-task units when
action=commit, setting stagedPendingCommit=true and capturing task context.
postUnitPostVerification commits the staged index after the gate passes,
using a conventional-commit message built from the task context. Failure is
non-fatal (logWarning + UI warning). 11 structural tests cover the full
deferral lifecycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 20:33:39 +02:00
Mikael Hugo
315c2c49ca sf: fail-closed verification gate + deferred-commit infrastructure
Fix 2: verification gate no longer passes when no commands are
configured. Empty-commands result now returns passed=false, skipped=true.
Updated verification-gate.test.ts; added skipped-result guard in
auto-verification.ts that warns and continues (not a hard failure).

Fix 3: split auto-verification.ts try/catch into two zones. Zone 1
(gate machinery: prefs load, task lookup, runVerificationGate,
captureRuntimeErrors, runDependencyAudit) catches → pauseAuto + return
"pause". Zone 2 (ancillary: evidence writes, UOK gate, notifications)
catches → logWarning + return "continue". Added verification-fail-
closed.test.ts with 11 structural tests.

Fix 1 (infrastructure): added stageOnly() + commitStaged() to
GitServiceImpl, added stagedPendingCommit flag to AutoSession (cleared
in reset()), marked the runTurnGitAction call site in
postUnitPreVerification with TODO(fix-1-deferral) for the final wiring.

Fix 4: timeout handler in runFinalize now captures hadStagedPending and
hadCommitted before nulling currentUnit. Clears stagedPendingCommit to
prevent orphaned deferred commits. Emits a diagnostic warning for each
case so operators know whether staged-but-uncommitted changes will be
absorbed or whether a commit landed before verification was skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 19:32:47 +02:00
Mikael Hugo
c940ebc16f sf: unify milestone discuss dispatch + todo.md seed injection
Replace separate dispatchHeadlessBootstrap with one flow:
- dispatchNewMilestoneDiscuss({ auto }) — auto=true uses headless
  prompt + rootFiles seed, no pendingAutoStartMap; auto=false uses
  discuss prompt with preparation, sets pendingAutoStartMap
- bootstrapNewMilestone() — project setup + ID reservation, called
  directly from bootstrapAutoSession instead of the old wrapper
- injectTodoContext() — reads and deletes todo.md/TODO.md/SPEC.md at
  project root, injects content as spec into any preamble; called
  identically in auto and interactive flows

Removes dispatchHeadlessBootstrap entirely. auto-start.ts now calls
the primitives directly. All three showWorkflowEntry new-milestone
sites use dispatchNewMilestoneDiscuss({ auto: false }).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 19:04:12 +02:00
Mikael Hugo
67d25f95f2 sf: add gemini cli preflight token counting 2026-04-19 13:25:07 +02:00
Mikael Hugo
59806f8cc5 rip out antigravity from SF + pi-coding-agent UI/config layer
Antigravity (Google's IDE sandbox product, different from Gemini CLI) is
removed from:

  src/onboarding.ts                         — drop from LLM_PROVIDER_IDS + OAuth-flow picker
  src/pi-migration.ts                       — drop from LLM_PROVIDER_IDS migration list
  src/web/onboarding-service.ts             — drop from web-UI provider list
  src/tests/integration/web-onboarding-contract.test.ts — update contract
  src/resources/extensions/sf/doctor-providers.ts — drop from CLI_AUTH_PROVIDERS
  src/resources/extensions/sf/key-manager.ts      — drop UI listing
  src/resources/extensions/sf-usage-bar/index.ts  — delete entire quota fetcher block (~200 lines)
  packages/pi-coding-agent/src/cli/args.ts        — drop PI_AI_ANTIGRAVITY_VERSION doc
  packages/pi-coding-agent/src/utils/proxy-server.ts — drop from claude provider chain

Reason: antigravity has no vendor-published core library we can embed
(unlike @google/gemini-cli-core for the Gemini CLI). Continuing to
hand-roll OAuth against daily-cloudcode-pa.sandbox.googleapis.com is
exactly the pattern Google has started banning for third-party tools.
Removing the code removes the ban risk.

pi-ai provider code, OAuth util, and models.generated entries for
google-antigravity are removed in follow-up commits (separated for
reviewability — each layer verified independently).

Build passes. Note: this is a breaking change for any user who had
google-antigravity configured — they'll need to migrate to
google-gemini-cli (OAuth), google (API key), or google-vertex.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 10:39:36 +02:00
Mikael Hugo
0f0dcbf8c7 benchmarks: add Gemini 2.5/3/3.1 Pro + Flash entries
Gemini had zero benchmark entries in model-benchmarks.json despite
being served by google-gemini-cli (OAuth provider, SF native), google
(API key), google-vertex, google-antigravity, openrouter, etc. Every
gemini-* model in the pi-ai catalog scored 0 in the benchmark selector
— effectively excluded from auto-selection even when allow-listed.

Published numbers from DeepMind model cards + Vellum LLM leaderboard +
Vals AI:

  gemini-3-pro-preview:    SWE-Verified 76.2, HLE 37.5, AIME25 95,
                            GPQA-D 91.9, MMLU-Pro 81.0
  gemini-3.1-pro-preview:  SWE-Verified 78, HLE 41, AIME 97,
                            GPQA-D 93, MMLU-Pro 83 (Feb 2026)
  gemini-3-flash-preview:  estimated from Pro-vs-Flash delta
  gemini-2.5-pro:          SWE-Verified 63.8, HLE 18.8, GPQA-D 84.0,
                            MMLU-Pro 86
  gemini-2.5-flash:        estimated from Pro-vs-Flash delta

Context windows reflect Gemini's 1M-2M token capability.

LiveCodeBench Pro Elo (2439 for Gemini 3 Pro) isn't in the 0-100
percent schema — skipped rather than forced. Future: add arena_elo-
style LCB Elo dimension to the schema if we start routing on it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 10:11:45 +02:00
Mikael Hugo
e413cf4a3f preferences: add provider_preference for benchmark tie-breaking
When two models score identically in the benchmark selector — typically
the same underlying weights served by different endpoints — the
previous alphabetical tiebreaker picked wrong. dr-repo example:

  zai/glm-5.1       score 84.7
  opencode-go/glm-5.1 score 84.7

Both are the exact same GLM-5.1 weights. Alphabetical comparison made
opencode-go win ("o" < "z") even though zai is the NATIVE provider.

Fix: new `provider_preference` pref, an ordered list of providers.
Listed providers rank in order, unlisted fall after alphabetically.
Applied as the tie-breaker between score and alphabetical.

Global default shipped in ~/.sf/preferences.md:
  kimi-coding, minimax, zai, mistral, ollama-cloud, opencode-go,
  opencode

Native providers ranked before re-servers. Users can override per
project.

Verified: after the change, dr-repo picks zai/glm-5.1 as primary for
execute-task and gate-evaluate (was opencode-go/glm-5.1), and
kimi-coding/k2p5 stays primary for completion phases with its direct
provider winning over opencode re-servers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 10:09:42 +02:00
Mikael Hugo
345f9586dd benchmark-selector: coverage-confidence multiplier + 12 regression tests
The original "normalise by populated weight" was too aggressive: a model
with 1 strong dimension (delta-fast: human_eval=92) outranked a model
with 4 strong dimensions (beta-coder: swe_bench=85, lcb=90, he=95,
ifeval=90) because both normalised to their own small average.

Fix: multiply normalised score by a confidence factor tied to how much
of the unit's profile the model actually populated. Confidence =
populated_weight / total_profile_weight, blended 50/50 with a flat floor
so sparse-but-strong specialists still rank when no generalist covers
the profile:

  score = (weighted_sum / weight_total) * (0.5 + 0.5 * confidence)

Net effect on dr-repo's auto-resolve:

  Before:                          After:
  plan-milestone   glm-5.1         plan-milestone   MiniMax-M2.5
  research-slice   codestral       research-slice   mistral-large-2411
  execute-task     mistral-large   execute-task     opencode-go/glm-5.1
  validate-m       magistral       validate-m       MiniMax-M2.5
  subagent         mistral-large   subagent         kimi-coding/k2p5

MiniMax's broad coverage (8 populated dimensions from the M2 README)
now correctly outranks GLM-5.1's higher but narrower scores for
reasoning-heavy units. Matches user intuition that "MiniMax is really
powerful".

Also fixes findBenchmarkKey to try "<modelId>-latest" for date-suffixed
model variants — pi-ai catalogs "devstral-medium-2507" but benchmarks
only have "devstral-medium-latest"; matcher now bridges that.

12 regression tests cover:
  - empty candidate pool
  - each profile (reasoning/coding/lightweight) picks right champion
  - swe_bench ↔ swe_bench_verified equivalence
  - models with all-null benchmarks score 0 but stay in fallbacks
  - sparse-strong beats dense-weak (confirms confidence multiplier
    doesn't over-penalise specialists)
  - provider diversification in fallback chain
  - deterministic tie-breaking
  - unknown unit types use default coding profile
  - date-suffixed model IDs match family-latest keys

Audit: 41 of 85 allow-listed models in pi-ai catalog have benchmark
data. 44 score 0 (mostly opencode Zen re-served models, ministral
small variants, pixtral vision models, legacy open-mistral). Top
picks for every dr-repo unit type DO have benchmark data — the gap
is in the long tail of fallbacks, which never matter unless the
primary and closer fallbacks all fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 09:58:10 +02:00
Mikael Hugo
0b8a1c246f auto-benchmark model selection: pick best-scoring per unit type
New module src/resources/extensions/sf/benchmark-selector.ts implements
benchmark-driven model selection. When models.<unit> is not pinned,
preferences-models.ts falls through to pick the highest-scoring
candidate from allowed_providers × pi-ai's model catalog, ranked
against a per-unit-type weight profile.

Weight profiles per unit type:
  plan-milestone / plan-slice  → agent-planning (swe_bench .25, lcb
                                  .20, hle .15, gpqa .15, mmlu_pro .15,
                                  aime .10)
  research-*                    → mixed (mmlu_pro, hle, human_eval,
                                  browse_comp, simple_qa, gpqa)
  execute-task                  → coding (swe_bench .35, swe_bench_v
                                  .25, lcb .20, human_eval .15)
  execution_simple / complete-* → fast+correct (human_eval .40,
                                  instruction_following .35, ruler .25)
  gate-evaluate                 → review (swe_bench .30, hle .25,
                                  gpqa .25, ifeval .20)
  validate-milestone            → validation (hle .30, gpqa .25,
                                  mmlu_pro .25, swe_bench .20)

Key design decisions:
  - Missing dimensions are dropped (normalised by populated weight),
    so a model with 2 strong populated scores isn't crushed by a peer
    with 5 mediocre ones.
  - swe_bench ↔ swe_bench_verified are fungible — some vendors publish
    one, some the other; treat as equivalent.
  - Provider diversification in fallbacks so one provider going 429
    doesn't kill the whole chain.
  - Score ties broken by coverage, then lexical — deterministic.

Also updates MiniMax-M2/M2.5/M2.7 benchmarks with real numbers from
the M2 official README (DeepWiki sourced) and MiniMax-M2.5 card
(minimax.io): swe_bench_verified 69.4→80.2, LCB 83, HLE 31.8 (w/
tools — more representative for agent work than no-tools 12.5),
AIME25 78, GPQA-D 78, MMLU-Pro 82. Context windows bumped to
weights-level: M2 400K, M2.5/M2.7 1M (endpoints may cap lower).

Verified end-to-end: with dr-repo's allow-list
(kimi-coding/minimax/zai/opencode-go/mistral) and models.* absent,
resolveModelWithFallbacksForUnit() returns:
  plan-milestone     → opencode-go/glm-5.1 (+3 fallbacks)
  research-slice     → mistral/codestral-latest
  execute-task       → mistral/mistral-large-latest
  execution_simple   → kimi-coding/k2p5
  gate-evaluate      → opencode-go/glm-5.1
  validate-milestone → mistral/magistral-medium-latest
  subagent           → mistral/mistral-large-latest

Users can still pin individual units (existing models.* behaviour
unchanged) or rely fully on auto-selection by omitting them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 09:43:26 +02:00
Mikael Hugo
6450b37025 core + search + benchmarks: auth-error recovery, multi-provider search, M2.7-highspeed entry
Four related improvements that landed in the working tree after the
auto-hardening merge but hadn't been committed:

1. auth_error as a distinct error type (auth-storage + retry-handler).
   Previously invalid/expired API keys would retry the same failing
   credential until the retry budget exhausted. Now:
     - classifyErrorType() recognizes 401s, "invalid api key",
       "authentication error", "unauthorized" etc as "auth_error"
     - RetryHandler triggers cross-provider fallback on auth_error just
       like it does for rate_limit and quota_exhausted — switch
       providers rather than burning retries on a broken key
   Outcome: a stale OPENCODE_API_KEY in sops now fails over to kimi or
   minimax immediately instead of stalling the unit.

2. Multi-provider search-key detection (native-search.ts).
   The "Web search: Set BRAVE_API_KEY" warning fired whenever a
   non-Anthropic model lacked BRAVE_API_KEY, even when the user had
   TAVILY_API_KEY or OLLAMA_API_KEY available. Now: the warning
   suppresses if any of BRAVE/TAVILY/OLLAMA keys is present, and the
   warning text lists all three options. Matches the preferences-
   validation allow-list for search_provider.

3. MiniMax-M2.7-highspeed benchmark entry (model-benchmarks.json).
   Routes the fast-tier variant of M2.7 through the Bayesian blender
   with inherited RULER scores. Lets dynamic routing consider the
   highspeed model when speed matters more than peak quality.

No regressions: the 41 pre-existing test failures in pi-coding-agent
(FallbackResolver chain-membership + LSP integration) are unchanged
relative to the prior commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 09:24:54 +02:00