singularity/singularity-forge

Fork 0

Mikael Hugo 67d25f95f2 sf: add gemini cli preflight token counting

2026-04-19 13:25:07 +02:00

12 KiB

Raw Blame History

Gemini CLI Core Follow-Up TODO

Last updated: 2026-04-19

Context:

Transport re-platform landed in commit d83a59fb1 (pi-ai/google-gemini-cli: re-platform transport on @google/gemini-cli-core).
Model list wiring landed in commit 8abfc98fd (pi-ai: source google-gemini-cli model list from cli-core's VALID_GEMINI_MODELS).
The remaining work is to exploit more of @google/gemini-cli-core instead of keeping SF-specific handwritten logic around quota, retry classification, and model naming.

Relevant files:

Recommended order

Token counting via cli-core
Quota surfacing via cli-core
Structured Gemini error classification
Gemini model aliasing
Lower-priority follow-ups (grounding, tests, keychain, telemetry)

1. Token counting via `server.countTokens(...)`

Status:

Not integrated.
SF currently has generic token estimation in src/resources/extensions/sf/token-counter.ts, mostly tiktoken or chars/4.
There is already a before_provider_request hook path in src/resources/extensions/sf/bootstrap/register-hooks.ts, which is the cleanest place to attach pre-dispatch counting for Gemini requests.

Why this matters:

Gives an accurate preflight token count for Gemini before dispatch.
Lets SF warn before sending oversized prompts instead of only reacting to provider overflow errors.
Enables future auto-compaction logic to use provider-native counts instead of heuristics.

Concrete implementation:

Add a small helper in packages/pi-ai/src/providers/google-gemini-cli.ts or a neighboring helper module that builds a CodeAssistServer and exposes countTokens.
Thread enough auth/project context into the hook path so Gemini requests can be counted without re-implementing auth discovery.
In before_provider_request, when the target model/provider is google-gemini-cli, call cli-core countTokens on the request payload before send.
Preserve graceful fallback to current heuristics when auth is missing, cli-core throws, or the request shape is unsupported.
Surface the result to the UI/logging path where it is actually actionable:
- warning banner for obviously large prompts,
- optional metrics/session recording,
- optional compact-context trigger later.

Likely touchpoints:

packages/pi-ai/src/providers/google-gemini-cli.ts
src/resources/extensions/sf/bootstrap/register-hooks.ts
src/resources/extensions/sf/token-counter.ts
possibly packages/pi-coding-agent/src/core/sdk.ts if the hook plumbing needs richer request/model context.

Acceptance criteria:

Gemini requests can report an accurate pre-dispatch token count without sending an inference request.
Failures in counting do not block dispatch.
Existing token-counter tests remain green; add focused tests for Gemini preflight behavior.

Open question:

Decide whether the counted value belongs in ephemeral UI only, metrics only, or both. Do not wire it into multiple places until one storage/display path is chosen.

2. Quota surfacing via `server.retrieveUserQuota(...)` and `refreshAvailableCredits()`

Status:

src/resources/extensions/sf-usage-bar/index.ts still manually reads OAuth tokens and POSTs directly to https://cloudcode-pa.googleapis.com/v1internal:retrieveUserQuota.
That code bypasses cli-core even though cli-core already knows the right request/response path.

Why this matters:

Replaces fragile handwritten Gemini usage fetching.
Avoids relying on duplicated credential discovery and raw fetch behavior.
Makes the usage bar consistent with the same client stack the provider now uses.
Gives a better foundation for distinguishing rate limit exhaustion vs billing/quota exhaustion.

Concrete implementation:

Extract Gemini auth/client creation into a reusable helper instead of duplicating token parsing in the usage bar.
Replace the raw fetchGeminiUsage() request with cli-core quota APIs.
Normalize cli-core quota response into the existing UsageSnapshot / RateWindow shape used by the usage bar.
Keep current UI semantics where possible:
- show Pro and Flash windows if quota exists,
- preserve non-fatal error display,
- keep timeouts bounded.
If cli-core returns richer quota detail, prefer surfacing remaining counts instead of only percentages when it fits the existing footer.

Likely touchpoints:

src/resources/extensions/sf-usage-bar/index.ts
packages/pi-ai/src/providers/google-gemini-cli.ts or a new shared Gemini cli-core helper
auth plumbing if the usage bar should stop reading raw files directly

Acceptance criteria:

No direct raw POST to v1internal:retrieveUserQuota remains in sf-usage-bar.
Gemini usage bar continues to render successfully with valid OAuth credentials.
Missing/expired credentials degrade to a clear non-crashing error state.

Follow-up if this lands cleanly:

Remove duplicated token loading from auth.json and ~/.gemini/oauth_creds.json wherever cli-core can own it.

3. Structured Gemini error classification

Status:

packages/pi-coding-agent/src/core/retry-handler.ts still decides retryability through large regexes in isRetryableError() and downstream string classification.
Claude already identified cli-core exports such as getErrorStatus, ModelNotFoundError, and googleQuotaErrors as a better source of truth for Gemini-specific failures.

Why this matters:

Current retry behavior is broad and stringly-typed.
Regex classification is hard to audit and easy to drift from upstream wording changes.
Structured classification would let Gemini-specific retry/backoff behavior be smarter without perturbing other providers.

Concrete implementation:

Add a Gemini-specific error mapping layer that:
- consumes cli-core error helpers when the active provider is google-gemini-cli,
- converts those into SF retry classes (rate_limit, quota_exhausted, auth_error, model_not_found, non-retryable, etc.).
Keep non-Gemini providers on the existing generic path unless there is a broader retry refactor planned.
Update retry decisions so Gemini can distinguish:
- temporary rate-limit/cooldown,
- quota exhaustion / credits exhausted,
- auth problems,
- invalid model / model not found.
Reduce duplicated regex patterns once the structured path is proven.

Likely touchpoints:

packages/pi-coding-agent/src/core/retry-handler.ts
possibly packages/pi-ai/src/providers/google-gemini-cli.ts if provider-specific error wrapping is needed
Gemini retry tests and any SF extension tests asserting rate-limit recovery behavior

Acceptance criteria:

Retry behavior for google-gemini-cli is driven by cli-core-aware classification rather than raw message regex where feasible.
Existing non-Gemini retry behavior does not regress.
Tests cover at least one case each for rate limit, quota exhaustion, auth failure, and model-not-found.

Important constraint:

Do not over-generalize this into a cross-provider retry framework in the same change. Keep the blast radius Gemini-scoped first.

4. Gemini model aliasing (`auto`, `pro`, `flash`, `flash-lite`)

Status:

Concrete model list wiring exists via cli-core VALID_GEMINI_MODELS.
User-facing model resolution still expects explicit IDs in places like src/resources/extensions/sf/auto-model-selection.ts.

Why this matters:

Lets preferences stay stable even as Google revs concrete model names.
Reduces churn in PREFERENCES.md and docs when preview/stable IDs change.
Complements the static model import already added from cli-core.

Concrete implementation:

Define Gemini aliases accepted by SF config and resolution:
- gemini:auto
- gemini:pro
- gemini:flash
- gemini:flash-lite
- decide whether equivalent slash or bare forms should also resolve.
Resolve aliases through cli-core’s model alias resolution rather than SF-local hardcoding where possible.
Hook alias resolution into the same places where model IDs are interpreted from preferences and dynamic routing.
Preserve current explicit-ID behavior exactly.
Add user-facing documentation once semantics are stable.

Likely touchpoints:

src/resources/extensions/sf/auto-model-selection.ts
model resolution helpers in SF preference/bootstrap code
possibly packages/pi-coding-agent/src/core/model-resolver.ts if alias resolution belongs closer to the registry layer

Acceptance criteria:

A user can specify Gemini aliases in preferences and have them resolve to a valid available model.
Explicit full model IDs still win when provided.
Unknown aliases fail clearly instead of silently picking an arbitrary Gemini model.

Risk to manage:

Avoid leaking alias behavior into non-Gemini providers or into provider/model parsing in a way that breaks existing provider/model resolution semantics.

5. Lower-priority follow-ups

5a. Google Search grounding tool

Potential value:

Native Gemini grounding could replace some Brave/Tavily usage for Gemini-dispatched work.

Why this is lower priority:

It touches provider tools, routing, and citation UX, which is a bigger behavioral change than quota/counting.

Suggested approach:

Investigate only after quota/counting/error classification are stable.
Keep the change Gemini-scoped and opt-in first.

5b. Test harness migration

Potential value:

makeFakeConfig, createMockWorkspaceContext, and MockMessageBus can make provider tests more realistic and less dependent on raw network mocking.

Suggested approach:

Fold this into whichever of items 1-4 lands first rather than doing it standalone.

5c. MCP keychain storage

Potential value:

Better OAuth token storage for MCP auth flows.

Why this is separate:

It is an auth-storage architecture change, not a Gemini provider follow-up.

5d. IDE detection, terminal serialization, experiment flags, telemetry

Potential value:

Nice integrations and better fidelity with upstream CLI behavior.

Why this is last:

None of them unblock current Gemini provider correctness or UX.

Proposed task breakdown for delegation

If these are delegated as separate rescue tasks, use this split:

gemini-token-counting

Implement cli-core-backed Gemini preflight counting.
Own hook integration and tests only.
Do not touch usage bar or retry logic.

gemini-quota-usage-bar

Replace Gemini raw quota fetch with cli-core quota APIs.
Own sf-usage-bar integration and Gemini auth helper reuse.
Do not touch retry handler or model resolution.

gemini-retry-classification

Introduce cli-core-backed Gemini error classification in retry handling.
Own retry tests and minimal provider-side wrapping if needed.
Do not change quota UI or aliasing.

gemini-model-aliases

Add auto/pro/flash/flash-lite alias resolution for Gemini preferences.
Own resolution/documentation/tests only.

Suggested definition of done

The follow-up is complete when:

Gemini transport, quota lookup, and retry classification all rely on cli-core where upstream already provides that behavior.
SF no longer carries duplicated handwritten Gemini logic for areas cli-core already owns.
User-facing configuration for Gemini is more stable than concrete preview model IDs.
Each change lands with focused tests and without broad cross-provider refactors.

12 KiB Raw Blame History Unescape Escape

Gemini CLI Core Follow-Up TODO

Recommended order

1. Token counting via server.countTokens(...)

2. Quota surfacing via server.retrieveUserQuota(...) and refreshAvailableCredits()

3. Structured Gemini error classification

4. Gemini model aliasing (auto, pro, flash, flash-lite)

5. Lower-priority follow-ups

5a. Google Search grounding tool

5b. Test harness migration

5c. MCP keychain storage

5d. IDE detection, terminal serialization, experiment flags, telemetry

Proposed task breakdown for delegation

Suggested definition of done

12 KiB

Raw Blame History

1. Token counting via `server.countTokens(...)`

2. Quota surfacing via `server.retrieveUserQuota(...)` and `refreshAvailableCredits()`

4. Gemini model aliasing (`auto`, `pro`, `flash`, `flash-lite`)