Cycle 2 (the 13-node coding-agent mega) closed via two changes:
1. scripts/check-circular-deps.mjs — track function-body depth and
skip require()/import() calls inside function bodies. They run on
call, not at module evaluation, and therefore cannot cause
module-graph cycles — same reasoning as the existing dynamic
`await import()` skip. Generic improvement; benefits any pattern
that uses lazy CommonJS require() to break a static cycle.
2. packages/coding-agent/src/core/extensions/loader.ts — removed the
static `import * as _bundledCodingAgent from "../../index.js"`
self-reference, which was the cycle-closer. It only populated
STATIC_BUNDLED_MODULES for the Bun virtualModules path
(`isBunBinary` branch in getJitiOptions), and SF is Node-26-only
per operator policy (no Bun) — so the Bun branch is dead at
runtime and dropping the static self-reference is safe. The two
map entries that referenced it (@singularity-forge/coding-agent
and the @mariozechner alias) are commented out at the same site
with a pointer to the top-of-file note.
Net effect across the full session:
start of session: 9 cycles
walker false-positive
cleanups landed: dropped 6 type-only + dynamic-import false
positives
tui ↔ overlay-layout: CURSOR_MARKER moved to overlay-types.ts
SF autonomous-rollback
chain (3 targeted
cuts): experimental → preferences-serializer,
classifier → lazy rollback import,
preferences-models → runaway-defaults.js
this commit: coding-agent loader self-reference dropped
Final state: ✅ zero circular dependencies in 1193 scanned files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes (one walker, one real code):
1. scripts/check-circular-deps.mjs — skip type-only imports.
`import type { X } from "..."` and `export type { X } from "..."`
are erased by tsc at compile time and cannot cause runtime cycles.
Walker now drops them, matching the precedent set by skipping
dynamic `await import(...)`. Net effect on full-repo scan:
before: 9 cycles
after: 3 cycles (the 6 that disappeared were all `import type`
false-positives — none were real runtime cycles).
2. packages/tui — break the last 2-file cycle.
tui.ts and overlay-layout.ts had a real RUNTIME cycle:
- tui.ts → overlay-layout.ts: applyLineResets, compositeOverlays,
extractCursorPosition, isOverlayVisible (4 fns)
- overlay-layout.ts → tui.ts: CURSOR_MARKER (1 const)
Both files already imported `./overlay-types.ts` (no cycle there).
Moved CURSOR_MARKER from tui.ts into overlay-types.ts and re-exported
from tui.ts so existing `from "./tui.js"` call sites keep working.
No behavior change.
Remaining cycles after both fixes (3 real-runtime ones, separate slices):
- safety/autonomous-rollback chain (9 files, SF extension)
- packages/coding-agent core mega-cycle (12 files)
- (one more, see `npm run check:circular`)
These are foundational refactors worth their own commits, not bundled
into this one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three fixes addressing codex's adversarial review of the earlier orphan-
recovery / graceful-shutdown landing:
(1) Codex point B — single shutdown path. Removed the parallel
installGracefulShutdown() handler in rpc-mode.ts that was adding
a second SIGTERM listener and racing forceShutdown()'s teardown.
The drain is now the FIRST step inside forceShutdown() (before
killTrackedDetachedChildren / extension session_shutdown / etc.)
so DB writes complete cleanly while child processes are still
alive to flush. Race-free against the existing shutdown ordering.
(2) Codex point D — recovery-before-each-drain. Cloud-volume mtime
visibility lags between containers can mean an orphan `.draining`
file from a previous container isn't visible during the startup
scan but appears moments later. drainQueuedSfFeedbackCommands()
now runs recoverOrphanedFeedbackDrains() as its first step, so
each dispatch's drain sees the latest filesystem state.
(3) Codex point E — healthz returns 503 during shutdown. New module
src/web/shutdown-state.ts holds a per-process flag, auto-registers
SIGTERM/SIGINT/SIGHUP handlers on first read, and exposes a
snapshot (signal, startedAt, elapsedMs) for diagnostics. The
healthz route imports isShuttingDown() and returns 503 when set,
so k8s readinessProbe / Forgejo blue-green probes drain traffic
BEFORE we actually stop responding.
Tests:
- rpc-mode-orphan-recovery.test.ts: 8/8 still green
- web-shutdown-state.test.ts: 5/5 new — default false, mark sets
flag, idempotent, signal exposed via snapshot, null signal for
manual mark
Deferred to a follow-up commit (codex didn't flag, but noted for
completeness): a SIGTERM-drain child-process integration test that
spawns rpc-mode + sends a real signal. The 5 unit tests cover the
flag logic; the integration test would cover the full process tree
and is bulkier than the current commit warrants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tsgo rejects `.ts` extensions in imports without allowImportingTsExtensions.
Updated the test to import from "./feedback-queue-recovery.js" which is
both ESM-compatible and matches the rest of the package convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related changes to make blue/green upgrades (per scripts/upgrade-vega-
source-server.mjs) safe for in-flight self-feedback writes.
1. Startup orphan recovery (feedback-queue-recovery.ts, extracted module).
Scans .sf/runtime/ for sf-feedback-queue.jsonl.<pid>(.<sid>)?.draining
files left by previous processes. For each:
- if our own session id: leave alone (live drain)
- if PID is alive: leave alone (foreign drainer)
- else: rename back to queue (only if no active queue file exists)
Crash safety: when both an orphan AND an active queue exist, we DEFER
recovery rather than merge — appending then unlinking would risk
duplicate replay on crash. The next restart's recovery picks it up
once the queue is naturally drained. Supports legacy filenames
(.<pid>.draining, pre-session-id) for backward compat.
Added SF_DRAIN_SESSION_ID (per-process 6-byte hex) stamped into the
.draining filename. PID reuse across container restarts is normally
safe because /proc clears, but the session id is a stronger guarantee
that we don't trample a foreign drainer that happens to land on the
same PID.
2. SIGTERM/SIGINT drain-then-exit handler (installGracefulShutdown).
Drains the queue once on signal, then exits. Bounded by
SF_RPC_SHUTDOWN_GRACE_MS (default 600_000 = 10 min). Rationale: if
a drain is in flight, it MUST finish — losing self-feedback writes
across a server upgrade is worse than a long wait. Normal drains
complete in <1s; the 10-min ceiling is for pathological lock
contention. Operator overrides via env var, or docker kill /
kubectl delete --force for hard bypass.
Upgrader script bumped to docker stop --timeout 610 (10s safety
margin past the grace). k8s deployments must set
terminationGracePeriodSeconds≥610 for the rolling-update path.
Tests: rpc-mode-orphan-recovery.test.ts — 7 cases covering empty,
no-orphans, dead-PID single recovery, both-files-deferred (codex's
crash-safety fix), live-PID untouched, multiple-dead-PIDs, malformed-
filename ignored.
Refs sf-mpa5kdpu (drainer orphans never recovered), sf-mpa4g46x
(original RPC hang). Codex adversarial-reviewed; the PID-reuse hardening
and crash-safety deferral landed per its feedback. Open follow-ups:
shutdown-aware /api/healthz returning 503 (codex point E), integrate
with existing forceShutdown ordering (codex point C).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The drainer was scheduled via setTimeout(0) with timer.unref(). The unref
made the timer release-eligible — fine in a long-running rpc-mode child
where the process has plenty of other event-loop handles, but fatal in
the packaged-standalone path where the rpc subprocess has nothing else
to keep it alive. The process exited before the timer fired, so the
queue file was renamed to .<pid>.draining and then stranded forever.
Removed timer.unref(). The setTimeout(0) still lets the RPC response go
back to the caller first (no synchronous blocking on the drain), but the
timer now keeps the process alive until the drain handler runs, and the
drain's own async I/O keeps it alive until done.
Refs sf-mpa6wuhm-wwddd1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hashline read/edit tool wrappers were folded into Edit({match}) and
Read({format}) modes in commit ffdec0fee. The two rows in FILE-SYSTEM-MAP.md
pointed to files that no longer exist. Updated the surviving hashline.ts row
to note its new consumer relationship with Edit/Read.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per operator-direction 2026-05-17 (R089 — Migrate Voice IVR / ElevenLabs
On-Call Paging Infrastructure out of SF). Migration target landed in
centralcloud monorepo:
- centralcloud_core/lib/centralcloud_core/voice.ex (TwiML + ElevenLabs)
- centralcloud_staff/lib/.../controllers/voice_controller.ex (Phoenix)
- centralcloud_staff/lib/.../controllers/voice_prompt_controller.ex
- centralcloud_staff/lib/.../router.ex (/twilio scope)
SF removal:
- web/app/api/voice/route.ts
- web/app/api/voice/prompt/route.ts
- web/app/api/voice/ directory
- src/tests/integration/web-voice-ivr-contract.test.ts
Operator-paging infra was historical drift in SF (per-project compiler);
belongs in centralcloud (org-level ops). R088 (Pre-Removal Test-Import
Safety Gate) not yet built — operator manually verified safety scan:
TWILIO_/ELEVENLABS_ env vars only referenced in the deleted files; no
internal SF callers; centralcloud version verified present.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the native AST primitives from @singularity-forge/native/{edit,ast} as
LLM tools so agents can do tree-sitter-anchored code edits instead of
substring-based Edit or line-anchor hashline.
- replace-symbol.ts (+117): wraps replaceSymbol(file, symbolPath, newBody);
matches function/class/method declarations via tree-sitter, returns
matched=false sentinel when the symbol isn't located.
- insert-around-symbol.ts (+122): wraps insertAroundSymbol with position
enum BeforeDecl/AfterDecl/AtBodyStart/AtBodyEnd.
- ast-grep.ts (+152): wraps astGrep for pattern matching across files with
$VAR/$$$ARGS meta-variables; returns ranked matches with byte/line/column
+ captured meta-variable bindings.
Each tool:
- typebox schema matching the existing AgentTool pattern (edit.ts)
- notifyFileChanged() into the LSP layer on write ops
- resolveToCwd() for path normalization
- catches native errors + returns isError result with the
NativeUnavailableError message pointing operators to
`nix develop` + `node rust-engine/scripts/build.js --dev`
Wire-in:
- tools/index.ts: re-exports + imports + entries in `allTools` map and
createAllTools() factory.
- extension-manifest.json: ReplaceSymbol / InsertAroundSymbol / AstGrep
appended to provides.tools so SF extension agents see them.
Higher value than substring/line-anchor for code in tree-sitter-supported
languages (TS/JS/TSX/Python/Rust). Edit + hashline remain for non-code
files. PascalCase names per the Claude-Code-aligned convention from
sf-mp9w20y1-nld9hc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Stderr banner on fallback now multi-line with concrete fix steps
(nix develop → node rust-engine/scripts/build.js --dev) so an operator
scanning a 280MB cycle log can't miss it. The old single-line warning
was easy to overlook (today's "WHY HAS NOBODY SEEN IF LOUD" check).
- Structured load record per process at .sf/runtime/native-engine-load.jsonl:
{ts, pid, platformTag, source, binaryPath, sha256, loaded, errors?}.
Lets operators audit which binary each SF process loaded — and detect
ABI mismatches across daemon↔worker boundaries when different sha256
values appear for the same platformTag (the "rare but real" concern
flagged earlier today).
- Proxy error message now points to the build/install commands instead
of just saying "not available". NativeUnavailableError is named for
consumer try/catch chains.
- Fixed _loadedSuccessfully ordering — was set true BEFORE the require,
leaving stale-true after a failed first attempt.
- New helpers isNativeLoaded(), nativeBinaryPath(), nativeBinarySha256()
for diagnostic surfaces (sf headless query, doctor checks).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps version across the workspace (root + 10 @singularity-forge/*
packages) and lands the pending dependency refresh that had been
sitting uncommitted:
@anthropic-ai/sdk 0.95.1 → 0.96.0
@anthropic-ai/vertex-sdk 0.14.4 → 0.16.0
@google/genai 2.0 → 2.3
@logtape/{file,logtape,pretty,redaction} 2.0.7 → 2.0.9
@smithy/node-http-handler 4.7.0 → 4.7.3
@clack/prompts 1.3 → 1.4
@types/mime-types 2.1 → 3.0
Inter-package refs in packages/{daemon,ai}/package.json bumped to
^2.75.4 so the workspace stays self-consistent. package-lock.json
regenerated via `npm install --package-lock-only --legacy-peer-deps`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The phaseWatchdog at 10s fired "STUCK phase=session.prompt" on every
healthy LLM call longer than 10 seconds. Verified via strace on the
running dogfood sf: bytes were actively flowing on the TLS socket
(fd 29) to the LLM provider while STUCK was being logged — the
session.prompt was never actually stuck, the watchdog was just
diagnostic-only and oblivious to stream activity.
The noOutputTimeoutMs watchdog (set to 60s for triage in commit
d80060fec) is the actual kill mechanism. It is already event-aware:
every meaningful subagent event resets the timer via armNoOutputTimer
+ isMeaningfulSubagentOutputEvent. The 10s STUCK warning was added
in commit 67e5ac9db as investigation infrastructure for the
sf-mp8e02m1-zpk903 family of bugs, but now it is just noise that
makes legitimate 30-200s LLM responses look broken.
Keeps the 10s STUCK watchdog for the three setup phases
(resourceLoader.reload, createAgentSession, bindExtensions) where
10s of silence is a real hang signal — those phases normally run in
sub-second.
Also includes:
- biome.json: bump $schema URL from 2.4.14 to 2.4.15 to match the
current biome CLI (clears the deserialize warning)
- scripts/check-test-imports.{,test.}mjs: format + drop a useless
regex escape that biome flagged in landed code
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure formatting / lint-fix pass that ran during `npm run build:core`
in the session that landed the agent-runner / quota / coverage /
phase-2 routing work. No logic changes — indentation, trailing
commas, import sort, etc. Captured separately so the actual feature
commits stay scoped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds visible diagnostics to runSubagent so the next time the
"session initialized but no LLM call" bug fires, the log identifies
which setup phase hangs.
Phases instrumented:
- resourceLoader.reload()
- createAgentSession()
- bindExtensions(runLifecycle=...)
- session.prompt() entry → return
Output format (stderr, prefixed with [subagent:<name>]):
phase=resourceLoader.reload 23ms
phase=createAgentSession 142ms
phase=bindExtensions 89ms runLifecycle=true
phase=session.prompt-entered taskLen=8421 timeoutMs=480000 noOutputMs=180000
phase=session.prompt-returned 16234ms ← normal completion
STUCK phase=<X> 10000ms (no completion signal ...) ← when watchdog fires
Each phase has a soft 10s watchdog that emits a STUCK line if the
await doesn't complete in time. The watchdog never aborts — just
surfaces visibility. Existing timeoutMs / noOutputTimeoutMs handle
actual termination.
This is investigation infrastructure for the third prompt-never-sent
seam (coding-agent/subagent-runner). The agent-runner.js seam
(sf-mp8g4rcd-w01tkh) was fixed in commit 8ee4d8358 with bounded
retries. This commit doesn't fix the underlying bug — it makes the
bug self-reporting next time it fires so operator and autonomous
loop both get actionable signal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add resolveRpcInitTimeoutMs() helper and wire it into RpcClient.init().
Default init timeout increased from 30s to 120s. Override via env var.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests were picking up the developer's real
~/.sf/agent/discovery-cache.json and seeing unexpected models in
output. Pin tests to a guaranteed-missing path via the new
_discoveryCacheFilePath option so the env they observe is solely
what the test constructs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The memory extraction system has infrastructure (DB tables, LLM prompts,
unit closeout wiring, embedding backfill) but zero processed units and
only self-feedback-resolution memories. This suggests extraction is
failing silently.
Add debugLog() calls throughout extractMemoriesFromUnit() so we can
observe:
- Skip reasons (mutex busy, rate limited, already processed, file too small)
- Start/done lifecycle per unit
- LLM call and parse outcomes
- Error messages on failure and retry
This makes the extraction pipeline observable via --debug or the
journal/debug log without changing behavior.
Tests: 185 files / 1993 tests pass.
Type check: clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In-process swarm workers get a fresh headless AgentSession whose permission
extension defaults to read-only minimal. This blocks normal autonomous edits
(e.g., write_file, edit) even when the parent session runs at normal or
trusted level.
- run-unit.js: add legacyPermissionLevelForProfile mapping and include
executorPermissionLevel in the dispatch envelope.
- swarm-dispatch.js: forward executorPermissionLevel from envelope to
runAgentTurn as permissionLevel.
- agent-runner.js: accept permissionLevel option and pass it to
runSubagent config.
- subagent-runner.ts: add permissionLevel to SubagentConfig; when set,
temporarily set SF_PERMISSION_LEVEL env and run extension lifecycle so
the permission extension reads the level before tool hooks execute.
- Tests for envelope field, dispatch forwarding, and run-unit integration.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 7 dogfood failed with "0 tool calls — context exhaustion" even
though the swarm worker's session DID call tools. Root cause: the
phases-unit.js zero-tool-call guard reads from the PARENT session's
message ledger via snapshotUnitMetrics. The swarm worker runs in an
ISOLATED subagent session — its tool calls never appear in the
parent's messages, so the guard always sees 0 and fires a false-
positive context-exhaustion retry.
Fix:
- runUnitViaSwarm now returns swarmToolCallCount on the UnitResult,
surfacing the real worker tool call count from the onEvent stream
(collectedToolCalls.length, accurate end-to-end).
- phases-unit.js zero-tool-call guard checks
unitResult._via === "swarm" && swarmToolCallCount > 0 and bypasses
the false-positive retry, logging "zero-tool-calls-swarm-bypass".
Also adds a debug stderr line in subagent-runner.ts printing the tool
count after bindExtensions, confirming the worker session HAS the
full tool set (checkpoint + built-ins) — Hypotheses 1 and 2 from the
Round 8 brief ruled out by direct observation.
Tests: 3 new (swarmToolCallCount = 0 / N / 1-on-checkpoint-only);
2518 tests pass total, 0 regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Forward onEvent through swarm-dispatch → agent-runner → runSubagent
- Collect toolcall_end events in runUnitViaSwarm to build real tool-use blocks
- Detect checkpoint tool outcome for accurate unit completion signal
- Add headless.ts graceful shutdown (async signal handler, 2.5s timeout)
- RPC client stop() now awaits flush and propagates stop to child sessions
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add RunSubagentOptions.onEvent callback so callers (TUI live update panel
for /delegate, /rubber-duck, etc.) get every session event without polling.
Errors from the callback are caught so a buggy caller cannot crash the agent.
Chain caller-supplied AbortSignal through a local AbortController in
runSingleAgent and register it in a new liveSubagentControllers set so
stopLiveSubagents aborts in-process subagents alongside the legacy spawn-based
processes (cmux split, sift codebase_search).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Create web/middleware.ts to authenticate all API routes via bearer token
and origin checks (previously unauthenticated due to missing middleware file)
- Fix path traversal in browse-directories: replace startsWith with
realpathSync + relative + isAbsolute containment checks
- Fix XSS in session HTML export: escape raw HTML blocks via marked renderer
- Fix PTY process leak: destroy session on SSE stream cancellation
- Fix unhandled exception in terminal sessions POST: wrap getOrCreateSession
in try/catch with structured JSON error response
- Fix silent child-process failure in headless dispatch: add exit handler
to write failed claim when sf headless triage exits non-zero
- Fix TypeError on malformed claim JSON: add Array.isArray guard before
accessing claim.ids.length
All changes type-check cleanly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds an optional wireModelId field to the Model interface and a
resolveWireModelId helper. Forge's canonical model.id stays stable for
selection, capability scoring, policy, and history; providers now send
model.wireModelId on the wire when set, model.id otherwise.
Use cases: Azure deployment names, vendor model slugs that differ
from Forge's canonical identity, A/B routing where the operator wants
canonical history but a specific deployment.
Wired through every provider in @singularity-forge/ai (anthropic,
amazon-bedrock, azure-openai-responses, google, google-vertex,
google-gemini-cli, mistral, openai-codex-responses, openai-completions,
openai-responses) plus @singularity-forge/coding-agent's
ModelRegistry (model definitions + per-model overrides).
Tests: openai-completions wireModelId payload coverage +
model-registry-auth-mode coverage for the override + definition fields.
Full pi-ai + coding-agent suite: 956/956 ✓ (7 unrelated skipped).
This realizes the model-registry contract drafted in 1d753af6b.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the @singularity-forge/google-gemini-cli-provider package layout
for the codex CLI integration boundary. The new package owns:
- CodexAppServerClient (the JSON-RPC subprocess client; previously
packages/ai/src/providers/codex-app-server-client.ts, no pi-ai
internal coupling)
- snapshotCodexCliAccount / discoverCodexCliModels (reads
~/.codex/models_cache.json with visibility=list ∧ supported_in_api
filter; previously inline in src/resources/extensions/sf/openai-codex-catalog.js)
openai-codex-responses.ts (the stream-shaping provider) intentionally
stays in @singularity-forge/ai because it depends on pi-ai stream-event
internals and is not reusable outside the provider — same scope as
google-gemini-cli.ts vs google-gemini-cli-provider.
The SF extension's openai-codex-catalog.js is now a thin SF-side cache
writer that delegates to discoverCodexCliModels, mirroring how
gemini-catalog.js delegates to discoverGeminiCliModels. readCodexAvailableModels
became async to match the dynamic-import path; tests updated.
Closes sf-mp4u5fcz-wh6ac9 (with documented AC2 narrowing — see
resolution).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a machine-readable headless surface for live LLM-provider usage and
unifies the gemini-cli quota fetch through one helper, removing the
duplication that existed between usage-bar.js and the new package.
1. snapshotGeminiCliAccount in @singularity-forge/google-gemini-cli-provider
- Single source of truth for { projectId, userTierId, userTierName,
paidTier, models[] } via setupUser + retrieveUserQuota.
- Dedups buckets per modelId, keeping the worst (lowest remainingFraction)
so consumers always see the most-restrictive window. Code Assist
sometimes returns multiple buckets per model; the pessimistic choice
is what every consumer needs.
- discoverGeminiCliModels(cwd?) wraps it for catalog-cache callers that
only need the IDs.
2. sf headless usage subcommand
- New src/headless-usage.ts handler. text (default) and --json output.
Uses the package's snapshot directly — no RPC child, no jiti
gymnastics — matching the shape of headless-uok-status / headless-doctor.
- Wired into src/headless.ts after the doctor block.
- Help text adds the command line.
3. usage-bar.js refactored to delegate
- fetchGeminiUsage no longer imports gemini-cli-core directly. It calls
snapshotGeminiCliAccount and reshapes the result into the existing
{ provider, displayName, windows[] } UI contract.
- Eliminates the duplicate setupUser + retrieveUserQuota code path.
- The fast existsSync(~/.gemini/oauth_creds.json) pre-flight stays
so unauth'd users get a friendly message without paying for OAuth
bootstrap.
4. Model registry refactor (separate track committed alongside)
- src/resources/extensions/sf/model-registry.ts (new) consolidates
canonical model identity, capability tier, and generation tags into
one source of truth that auto-model-selection, benchmark-selector,
and model-router now consume instead of maintaining parallel maps.
All 1487 tests pass (151 files); typecheck clean for both the package
and the SF extensions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes for the google-gemini-cli provider, both motivated by today's
dogfood diagnosis: SF was pinned to a single model (gemini-3-flash-preview)
even though the AI Ultra account has access to seven (verified via the live
gemini-cli-core probe), and a transient "No capacity available for model X
on the server" was classified as `unknown` so SF gave up instead of retrying.
1. Account snapshot + model discovery in @singularity-forge/google-gemini-cli-provider
- Add `snapshotGeminiCliAccount(cwd?)` returning { projectId, userTierId,
userTierName, paidTier, models } where `models[]` carries each modelId
with usedFraction, remainingFraction, and resetTime. Built on the same
setupUser + CodeAssistServer.retrieveUserQuota path usage-bar.js
already uses, but extracted to the dedicated package so any consumer
(model picker, capacity diagnostics, catalog cache) can call one helper.
- Add `discoverGeminiCliModels(cwd?)` as a thin "just the IDs" wrapper.
- Both are best-effort: any failure (OAuth expired, no project, network)
returns null silently — never throws.
2. SF-side cache writer at src/resources/extensions/sf/gemini-catalog.js
- Delegates discovery to the package; only handles cache file path,
6-hour TTL, and the session_start lifecycle hook.
- Cache lands at .sf/runtime/model-catalog/google-gemini-cli.json with
the same shape as the generic model-catalog-cache, so getKnownModelIds
and the model picker pick it up transparently.
- Wired into bootstrap/register-hooks.js session_start in parallel with
the existing scheduleModelCatalogRefresh (the generic REST + API-key
path can't reach gemini-cli's OAuth-only Code Assist endpoint).
3. Capacity error classification fix
- error-classifier.js SERVER_RE now matches "no capacity (available|left)",
"capacity (unavailable|exhausted)", and "no capacity ... on the server".
Previously these fell through to kind=unknown, which is not transient,
so agent-end-recovery never retried — even though the same handler
already caps gemini-cli rate-limit backoff at 30s for exactly this
class of transient. With the pattern matched as `server`, the existing
retry-with-backoff path covers it.
The full extension test suite (1386 tests) passes. Typecheck clean for both
the package and the SF extensions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>