singularity-forge/.plans/ollama-native-provider.md

242 lines
11 KiB
Markdown
Raw Permalink Normal View History

# Ollama Extension — First-Class Local LLM Support
## Status: DRAFT — Awaiting approval
## Problem
Ollama support in GSD2 currently requires manual `models.json` configuration. Users must:
1. Know the OpenAI-compatibility endpoint (`localhost:11434/v1`)
2. Manually list every model they want to use
3. Set compat flags (`supportsDeveloperRole: false`, etc.)
4. Use a dummy API key
There's an `ollama-cloud` provider for hosted Ollama, and a discovery adapter that can list models, but no first-class **local Ollama** extension that "just works."
## Goal
Make Ollama the easiest way to use GSD2 — zero config when Ollama is running locally. All Ollama functionality lives in a single extension: `src/resources/extensions/ollama/`.
## Architecture
Everything is a self-contained extension under `src/resources/extensions/ollama/`. The extension:
- Auto-detects Ollama on startup via health check
- Discovers and registers local models with the model registry
- Provides native Ollama API streaming (not OpenAI shim)
- Exposes `/ollama` slash commands for model management
- Registers an LLM-callable tool for model pull/status
Minimal core changes — only `KnownProvider` and `KnownApi` type additions in `pi-ai`, and `env-api-keys.ts` for key resolution. Everything else is in the extension.
## File Structure
```
src/resources/extensions/ollama/
├── index.ts # Extension entry — wires everything on session_start
├── ollama-client.ts # HTTP client for Ollama REST API (/api/*)
├── ollama-discovery.ts # Model discovery + capability detection
├── ollama-provider.ts # Native /api/chat streaming provider (registers with pi-ai)
├── ollama-commands.ts # /ollama slash commands (status, pull, list, remove, ps)
├── ollama-tool.ts # LLM-callable tool for model management
├── model-capabilities.ts # Known model capability table (context window, vision, reasoning)
└── types.ts # Shared types for Ollama API responses
```
## Scope
### Phase 1: Auto-Discovery + OpenAI-Compat Routing
**What:** Extension that auto-detects Ollama, discovers models, registers them using the existing `openai-completions` API provider. Zero config needed.
**Extension files:**
- `ollama/index.ts` — Main entry. On `session_start`:
1. Probe `localhost:11434` (or `OLLAMA_HOST`) with 1.5s timeout
2. If reachable, discover models via `/api/tags`
3. Register discovered models with `ctx.modelRegistry` using correct defaults
4. Show status widget if Ollama is detected
- `ollama/ollama-client.ts` — Low-level HTTP client:
- `isRunning()``GET /` health check
- `getVersion()``GET /api/version`
- `listModels()``GET /api/tags`
- `showModel(name)``POST /api/show` (details, template, parameters, size)
- `getRunningModels()``GET /api/ps` (loaded models, VRAM usage)
- `pullModel(name, onProgress)``POST /api/pull` (streaming progress)
- `deleteModel(name)``DELETE /api/delete`
- `copyModel(source, dest)``POST /api/copy`
- Respects `OLLAMA_HOST` env var for non-default endpoints
- `ollama/ollama-discovery.ts` — Enhanced model discovery:
- Calls `/api/tags` to get model list
- Calls `/api/show` per model (batch, cached) to get:
- `details.parameter_size` → estimate context window
- `details.families` → detect vision (clip), reasoning (deepseek-r1)
- `modelfile` → extract default parameters
- Returns enriched `DiscoveredModel[]` with proper capabilities
- `ollama/model-capabilities.ts` — Known model lookup table:
- Maps well-known model families to capabilities
- e.g., `llama3.1``{ contextWindow: 131072, input: ["text"] }`
- e.g., `llava``{ contextWindow: 4096, input: ["text", "image"] }`
- e.g., `deepseek-r1``{ reasoning: true, contextWindow: 131072 }`
- e.g., `qwen2.5-coder``{ contextWindow: 131072, input: ["text"] }`
- Fallback: estimate from parameter count if not in table
- `ollama/types.ts` — Ollama API response types
**Core changes (minimal):**
- `packages/pi-ai/src/types.ts` — Add `"ollama"` to `KnownProvider`
- `packages/pi-ai/src/env-api-keys.ts` — Add `"ollama"` key resolution (returns `"ollama"` placeholder — no real key needed)
- `src/onboarding.ts` — Add `"ollama"` to provider selection list
- `src/wizard.ts` — Add `ollama` entry (no key required)
**Model registration details:**
Each discovered model registers as:
```typescript
{
id: "llama3.1:8b", // from /api/tags
name: "Llama 3.1 8B", // humanized
api: "openai-completions", // uses existing provider
provider: "ollama",
baseUrl: "http://localhost:11434/v1",
cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
reasoning: false, // from capabilities table
input: ["text"], // from capabilities table
contextWindow: 131072, // from capabilities table or /api/show
maxTokens: 16384, // conservative default
compat: {
supportsDeveloperRole: false,
supportsReasoningEffort: false,
supportsUsageInStreaming: false,
maxTokensField: "max_tokens",
},
}
```
**Behavior:**
- `gsd --list-models` shows all locally-pulled Ollama models automatically
- `/model ollama/llama3.1:8b` works without any config file
- If Ollama isn't running, extension is silent — no errors, no models listed
- `models.json` overrides still work (user config wins over auto-discovery)
### Phase 2: Native Ollama API Provider (`/api/chat`)
**What:** A dedicated streaming provider that talks Ollama's native protocol instead of the OpenAI compatibility shim.
**Extension files:**
- `ollama/ollama-provider.ts` — Native `/api/chat` streaming:
- Registers `"ollama-chat"` API with `registerApiProvider()`
- Implements `stream()` and `streamSimple()`:
- Maps GSD `Context` → Ollama messages format
- Maps GSD `Tool[]` → Ollama tool format
- Streams NDJSON responses, maps back to `AssistantMessage` events
- Extracts `<think>` blocks for reasoning models (deepseek-r1, qwq)
- Ollama-specific options:
- `keep_alive` — control model memory retention (default: "5m")
- `num_ctx` — pass through model's context window
- `num_predict` — max output tokens
- Temperature, top_p, top_k
- Response metadata:
- `eval_count` / `eval_duration` → tokens/sec in usage stats
- `total_duration`, `load_duration` → performance visibility
- Vision support: converts image content to base64 for multimodal models
**Core changes:**
- `packages/pi-ai/src/types.ts` — Add `"ollama-chat"` to `KnownApi`
**Phase 1 models switch to `api: "ollama-chat"` by default.** Users can force OpenAI-compat via `models.json` override if needed.
**Why native over OpenAI-compat:**
- Full `keep_alive` / `num_ctx` control
- Better error messages (Ollama-native vs generic OpenAI)
- More reliable tool calling on Ollama's native format
- Performance metrics in response (tokens/sec)
- Foundation for model management commands
### Phase 3: Local LLM Management UX
**What:** `/ollama` slash commands and an LLM tool for model management.
**Extension files:**
- `ollama/ollama-commands.ts` — Slash commands registered via `pi.registerCommand()`:
- `/ollama` — Status overview:
```
Ollama v0.5.7 — running (localhost:11434)
Loaded:
llama3.1:8b 4.7 GB VRAM idle 3m
Available:
llama3.1:8b (4.7 GB)
qwen2.5-coder:7b (4.4 GB)
deepseek-r1:8b (4.9 GB)
```
- `/ollama pull <model>` — Pull with streaming progress via `ctx.ui.setWidget()`
- `/ollama list` — List all local models with sizes and families
- `/ollama remove <model>` — Delete a model (with confirmation)
- `/ollama ps` — Running models + VRAM usage
- `ollama/ollama-tool.ts` — LLM-callable tool registered via `pi.registerTool()`:
- `ollama_manage` tool — lets the agent pull/list/check models
- Parameters: `{ action: "list" | "pull" | "status" | "ps", model?: string }`
- Use case: agent detects it needs a model, pulls it automatically
**UX Flow:**
```
$ gsd
> /ollama
Ollama v0.5.7 — running (localhost:11434)
Loaded:
llama3.1:8b — 4.7 GB VRAM, idle 3m
Available:
llama3.1:8b (4.7 GB)
qwen2.5-coder:7b (4.4 GB)
deepseek-r1:8b (4.9 GB)
> /ollama pull codestral:22b
Pulling codestral:22b...
████████████████████████████░░░░ 78% (14.2 GB / 18.1 GB)
✓ codestral:22b ready
> /model ollama/codestral:22b
Switched to codestral:22b (local, Ollama)
```
## Implementation Order
1. **Phase 1** — Auto-discovery with OpenAI-compat routing. Biggest user impact, smallest risk.
2. **Phase 3** — Management UX (`/ollama` commands). Valuable even before native API.
3. **Phase 2** — Native `/api/chat` provider. Optimization over OpenAI-compat; do last.
## Core Changes Summary (minimal)
| File | Change |
|------|--------|
| `packages/pi-ai/src/types.ts` | Add `"ollama"` to `KnownProvider`, `"ollama-chat"` to `KnownApi` (Phase 2) |
| `packages/pi-ai/src/env-api-keys.ts` | Add `"ollama"` → always returns `"ollama"` placeholder |
| `src/onboarding.ts` | Add `"ollama"` to provider picker |
| `src/wizard.ts` | Add `"ollama"` key mapping (no key required) |
Everything else lives in `src/resources/extensions/ollama/`.
## Risks & Mitigations
| Risk | Mitigation |
|------|------------|
| Ollama not running — startup probe latency | 1.5s timeout; cache result; probe async so it doesn't block TUI paint |
| Model capabilities unknown | Known-model table + `/api/show` fallback + parameter_size estimation |
| Tool calling unreliable on small models | Detect param count; warn on <7B models |
| Ollama API changes between versions | Version detect via `/api/version`; stable endpoints only |
| Conflicts with `models.json` Ollama config | User config always wins; auto-discovered models merge beneath manual config |
| Extension disabled — no impact on core | Extension is additive; disabling removes all Ollama features cleanly |
## Testing Strategy
- Unit tests: `ollama-client.ts` with mocked fetch responses
- Unit tests: `ollama-discovery.ts` model capability parsing
- Unit tests: `ollama-provider.ts` message format mapping + NDJSON stream parsing
- Unit tests: `model-capabilities.ts` known model lookups
- Integration test: mock HTTP server simulating Ollama `/api/tags`, `/api/chat`, `/api/pull`
- Manual test: real Ollama instance with llama3.1, qwen2.5-coder, deepseek-r1
## Open Questions
1. **Startup probe** — Probe Ollama on `session_start` (adds ~1.5s if not running) or lazy on first `/model`? **Recommendation: async probe on session_start (non-blocking), eager if `OLLAMA_HOST` is set.**
2. **Auto-start** — Try to launch Ollama if installed but not running? **Recommendation: no — too invasive. Show helpful message in `/ollama` status.**
3. **Vision support** — Support multimodal models (llava, etc.) in Phase 2 native API? **Recommendation: yes, detected via capabilities table.**
4. **Model refresh** — How often to re-probe Ollama for new models? **Recommendation: on `/ollama list`, on `/model` command, and every 5 min (existing TTL).**