242 lines
11 KiB
Markdown
242 lines
11 KiB
Markdown
|
|
# Ollama Extension — First-Class Local LLM Support
|
||
|
|
|
||
|
|
## Status: DRAFT — Awaiting approval
|
||
|
|
|
||
|
|
## Problem
|
||
|
|
|
||
|
|
Ollama support in GSD2 currently requires manual `models.json` configuration. Users must:
|
||
|
|
1. Know the OpenAI-compatibility endpoint (`localhost:11434/v1`)
|
||
|
|
2. Manually list every model they want to use
|
||
|
|
3. Set compat flags (`supportsDeveloperRole: false`, etc.)
|
||
|
|
4. Use a dummy API key
|
||
|
|
|
||
|
|
There's an `ollama-cloud` provider for hosted Ollama, and a discovery adapter that can list models, but no first-class **local Ollama** extension that "just works."
|
||
|
|
|
||
|
|
## Goal
|
||
|
|
|
||
|
|
Make Ollama the easiest way to use GSD2 — zero config when Ollama is running locally. All Ollama functionality lives in a single extension: `src/resources/extensions/ollama/`.
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
Everything is a self-contained extension under `src/resources/extensions/ollama/`. The extension:
|
||
|
|
- Auto-detects Ollama on startup via health check
|
||
|
|
- Discovers and registers local models with the model registry
|
||
|
|
- Provides native Ollama API streaming (not OpenAI shim)
|
||
|
|
- Exposes `/ollama` slash commands for model management
|
||
|
|
- Registers an LLM-callable tool for model pull/status
|
||
|
|
|
||
|
|
Minimal core changes — only `KnownProvider` and `KnownApi` type additions in `pi-ai`, and `env-api-keys.ts` for key resolution. Everything else is in the extension.
|
||
|
|
|
||
|
|
## File Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
src/resources/extensions/ollama/
|
||
|
|
├── index.ts # Extension entry — wires everything on session_start
|
||
|
|
├── ollama-client.ts # HTTP client for Ollama REST API (/api/*)
|
||
|
|
├── ollama-discovery.ts # Model discovery + capability detection
|
||
|
|
├── ollama-provider.ts # Native /api/chat streaming provider (registers with pi-ai)
|
||
|
|
├── ollama-commands.ts # /ollama slash commands (status, pull, list, remove, ps)
|
||
|
|
├── ollama-tool.ts # LLM-callable tool for model management
|
||
|
|
├── model-capabilities.ts # Known model capability table (context window, vision, reasoning)
|
||
|
|
└── types.ts # Shared types for Ollama API responses
|
||
|
|
```
|
||
|
|
|
||
|
|
## Scope
|
||
|
|
|
||
|
|
### Phase 1: Auto-Discovery + OpenAI-Compat Routing
|
||
|
|
|
||
|
|
**What:** Extension that auto-detects Ollama, discovers models, registers them using the existing `openai-completions` API provider. Zero config needed.
|
||
|
|
|
||
|
|
**Extension files:**
|
||
|
|
- `ollama/index.ts` — Main entry. On `session_start`:
|
||
|
|
1. Probe `localhost:11434` (or `OLLAMA_HOST`) with 1.5s timeout
|
||
|
|
2. If reachable, discover models via `/api/tags`
|
||
|
|
3. Register discovered models with `ctx.modelRegistry` using correct defaults
|
||
|
|
4. Show status widget if Ollama is detected
|
||
|
|
- `ollama/ollama-client.ts` — Low-level HTTP client:
|
||
|
|
- `isRunning()` — `GET /` health check
|
||
|
|
- `getVersion()` — `GET /api/version`
|
||
|
|
- `listModels()` — `GET /api/tags`
|
||
|
|
- `showModel(name)` — `POST /api/show` (details, template, parameters, size)
|
||
|
|
- `getRunningModels()` — `GET /api/ps` (loaded models, VRAM usage)
|
||
|
|
- `pullModel(name, onProgress)` — `POST /api/pull` (streaming progress)
|
||
|
|
- `deleteModel(name)` — `DELETE /api/delete`
|
||
|
|
- `copyModel(source, dest)` — `POST /api/copy`
|
||
|
|
- Respects `OLLAMA_HOST` env var for non-default endpoints
|
||
|
|
- `ollama/ollama-discovery.ts` — Enhanced model discovery:
|
||
|
|
- Calls `/api/tags` to get model list
|
||
|
|
- Calls `/api/show` per model (batch, cached) to get:
|
||
|
|
- `details.parameter_size` → estimate context window
|
||
|
|
- `details.families` → detect vision (clip), reasoning (deepseek-r1)
|
||
|
|
- `modelfile` → extract default parameters
|
||
|
|
- Returns enriched `DiscoveredModel[]` with proper capabilities
|
||
|
|
- `ollama/model-capabilities.ts` — Known model lookup table:
|
||
|
|
- Maps well-known model families to capabilities
|
||
|
|
- e.g., `llama3.1` → `{ contextWindow: 131072, input: ["text"] }`
|
||
|
|
- e.g., `llava` → `{ contextWindow: 4096, input: ["text", "image"] }`
|
||
|
|
- e.g., `deepseek-r1` → `{ reasoning: true, contextWindow: 131072 }`
|
||
|
|
- e.g., `qwen2.5-coder` → `{ contextWindow: 131072, input: ["text"] }`
|
||
|
|
- Fallback: estimate from parameter count if not in table
|
||
|
|
- `ollama/types.ts` — Ollama API response types
|
||
|
|
|
||
|
|
**Core changes (minimal):**
|
||
|
|
- `packages/pi-ai/src/types.ts` — Add `"ollama"` to `KnownProvider`
|
||
|
|
- `packages/pi-ai/src/env-api-keys.ts` — Add `"ollama"` key resolution (returns `"ollama"` placeholder — no real key needed)
|
||
|
|
- `src/onboarding.ts` — Add `"ollama"` to provider selection list
|
||
|
|
- `src/wizard.ts` — Add `ollama` entry (no key required)
|
||
|
|
|
||
|
|
**Model registration details:**
|
||
|
|
Each discovered model registers as:
|
||
|
|
```typescript
|
||
|
|
{
|
||
|
|
id: "llama3.1:8b", // from /api/tags
|
||
|
|
name: "Llama 3.1 8B", // humanized
|
||
|
|
api: "openai-completions", // uses existing provider
|
||
|
|
provider: "ollama",
|
||
|
|
baseUrl: "http://localhost:11434/v1",
|
||
|
|
cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
|
||
|
|
reasoning: false, // from capabilities table
|
||
|
|
input: ["text"], // from capabilities table
|
||
|
|
contextWindow: 131072, // from capabilities table or /api/show
|
||
|
|
maxTokens: 16384, // conservative default
|
||
|
|
compat: {
|
||
|
|
supportsDeveloperRole: false,
|
||
|
|
supportsReasoningEffort: false,
|
||
|
|
supportsUsageInStreaming: false,
|
||
|
|
maxTokensField: "max_tokens",
|
||
|
|
},
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Behavior:**
|
||
|
|
- `gsd --list-models` shows all locally-pulled Ollama models automatically
|
||
|
|
- `/model ollama/llama3.1:8b` works without any config file
|
||
|
|
- If Ollama isn't running, extension is silent — no errors, no models listed
|
||
|
|
- `models.json` overrides still work (user config wins over auto-discovery)
|
||
|
|
|
||
|
|
### Phase 2: Native Ollama API Provider (`/api/chat`)
|
||
|
|
|
||
|
|
**What:** A dedicated streaming provider that talks Ollama's native protocol instead of the OpenAI compatibility shim.
|
||
|
|
|
||
|
|
**Extension files:**
|
||
|
|
- `ollama/ollama-provider.ts` — Native `/api/chat` streaming:
|
||
|
|
- Registers `"ollama-chat"` API with `registerApiProvider()`
|
||
|
|
- Implements `stream()` and `streamSimple()`:
|
||
|
|
- Maps GSD `Context` → Ollama messages format
|
||
|
|
- Maps GSD `Tool[]` → Ollama tool format
|
||
|
|
- Streams NDJSON responses, maps back to `AssistantMessage` events
|
||
|
|
- Extracts `<think>` blocks for reasoning models (deepseek-r1, qwq)
|
||
|
|
- Ollama-specific options:
|
||
|
|
- `keep_alive` — control model memory retention (default: "5m")
|
||
|
|
- `num_ctx` — pass through model's context window
|
||
|
|
- `num_predict` — max output tokens
|
||
|
|
- Temperature, top_p, top_k
|
||
|
|
- Response metadata:
|
||
|
|
- `eval_count` / `eval_duration` → tokens/sec in usage stats
|
||
|
|
- `total_duration`, `load_duration` → performance visibility
|
||
|
|
- Vision support: converts image content to base64 for multimodal models
|
||
|
|
|
||
|
|
**Core changes:**
|
||
|
|
- `packages/pi-ai/src/types.ts` — Add `"ollama-chat"` to `KnownApi`
|
||
|
|
|
||
|
|
**Phase 1 models switch to `api: "ollama-chat"` by default.** Users can force OpenAI-compat via `models.json` override if needed.
|
||
|
|
|
||
|
|
**Why native over OpenAI-compat:**
|
||
|
|
- Full `keep_alive` / `num_ctx` control
|
||
|
|
- Better error messages (Ollama-native vs generic OpenAI)
|
||
|
|
- More reliable tool calling on Ollama's native format
|
||
|
|
- Performance metrics in response (tokens/sec)
|
||
|
|
- Foundation for model management commands
|
||
|
|
|
||
|
|
### Phase 3: Local LLM Management UX
|
||
|
|
|
||
|
|
**What:** `/ollama` slash commands and an LLM tool for model management.
|
||
|
|
|
||
|
|
**Extension files:**
|
||
|
|
- `ollama/ollama-commands.ts` — Slash commands registered via `pi.registerCommand()`:
|
||
|
|
- `/ollama` — Status overview:
|
||
|
|
```
|
||
|
|
Ollama v0.5.7 — running (localhost:11434)
|
||
|
|
|
||
|
|
Loaded:
|
||
|
|
llama3.1:8b 4.7 GB VRAM idle 3m
|
||
|
|
|
||
|
|
Available:
|
||
|
|
llama3.1:8b (4.7 GB)
|
||
|
|
qwen2.5-coder:7b (4.4 GB)
|
||
|
|
deepseek-r1:8b (4.9 GB)
|
||
|
|
```
|
||
|
|
- `/ollama pull <model>` — Pull with streaming progress via `ctx.ui.setWidget()`
|
||
|
|
- `/ollama list` — List all local models with sizes and families
|
||
|
|
- `/ollama remove <model>` — Delete a model (with confirmation)
|
||
|
|
- `/ollama ps` — Running models + VRAM usage
|
||
|
|
- `ollama/ollama-tool.ts` — LLM-callable tool registered via `pi.registerTool()`:
|
||
|
|
- `ollama_manage` tool — lets the agent pull/list/check models
|
||
|
|
- Parameters: `{ action: "list" | "pull" | "status" | "ps", model?: string }`
|
||
|
|
- Use case: agent detects it needs a model, pulls it automatically
|
||
|
|
|
||
|
|
**UX Flow:**
|
||
|
|
```
|
||
|
|
$ gsd
|
||
|
|
> /ollama
|
||
|
|
Ollama v0.5.7 — running (localhost:11434)
|
||
|
|
Loaded:
|
||
|
|
llama3.1:8b — 4.7 GB VRAM, idle 3m
|
||
|
|
Available:
|
||
|
|
llama3.1:8b (4.7 GB)
|
||
|
|
qwen2.5-coder:7b (4.4 GB)
|
||
|
|
deepseek-r1:8b (4.9 GB)
|
||
|
|
|
||
|
|
> /ollama pull codestral:22b
|
||
|
|
Pulling codestral:22b...
|
||
|
|
████████████████████████████░░░░ 78% (14.2 GB / 18.1 GB)
|
||
|
|
✓ codestral:22b ready
|
||
|
|
|
||
|
|
> /model ollama/codestral:22b
|
||
|
|
Switched to codestral:22b (local, Ollama)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Implementation Order
|
||
|
|
|
||
|
|
1. **Phase 1** — Auto-discovery with OpenAI-compat routing. Biggest user impact, smallest risk.
|
||
|
|
2. **Phase 3** — Management UX (`/ollama` commands). Valuable even before native API.
|
||
|
|
3. **Phase 2** — Native `/api/chat` provider. Optimization over OpenAI-compat; do last.
|
||
|
|
|
||
|
|
## Core Changes Summary (minimal)
|
||
|
|
|
||
|
|
| File | Change |
|
||
|
|
|------|--------|
|
||
|
|
| `packages/pi-ai/src/types.ts` | Add `"ollama"` to `KnownProvider`, `"ollama-chat"` to `KnownApi` (Phase 2) |
|
||
|
|
| `packages/pi-ai/src/env-api-keys.ts` | Add `"ollama"` → always returns `"ollama"` placeholder |
|
||
|
|
| `src/onboarding.ts` | Add `"ollama"` to provider picker |
|
||
|
|
| `src/wizard.ts` | Add `"ollama"` key mapping (no key required) |
|
||
|
|
|
||
|
|
Everything else lives in `src/resources/extensions/ollama/`.
|
||
|
|
|
||
|
|
## Risks & Mitigations
|
||
|
|
|
||
|
|
| Risk | Mitigation |
|
||
|
|
|------|------------|
|
||
|
|
| Ollama not running — startup probe latency | 1.5s timeout; cache result; probe async so it doesn't block TUI paint |
|
||
|
|
| Model capabilities unknown | Known-model table + `/api/show` fallback + parameter_size estimation |
|
||
|
|
| Tool calling unreliable on small models | Detect param count; warn on <7B models |
|
||
|
|
| Ollama API changes between versions | Version detect via `/api/version`; stable endpoints only |
|
||
|
|
| Conflicts with `models.json` Ollama config | User config always wins; auto-discovered models merge beneath manual config |
|
||
|
|
| Extension disabled — no impact on core | Extension is additive; disabling removes all Ollama features cleanly |
|
||
|
|
|
||
|
|
## Testing Strategy
|
||
|
|
|
||
|
|
- Unit tests: `ollama-client.ts` with mocked fetch responses
|
||
|
|
- Unit tests: `ollama-discovery.ts` model capability parsing
|
||
|
|
- Unit tests: `ollama-provider.ts` message format mapping + NDJSON stream parsing
|
||
|
|
- Unit tests: `model-capabilities.ts` known model lookups
|
||
|
|
- Integration test: mock HTTP server simulating Ollama `/api/tags`, `/api/chat`, `/api/pull`
|
||
|
|
- Manual test: real Ollama instance with llama3.1, qwen2.5-coder, deepseek-r1
|
||
|
|
|
||
|
|
## Open Questions
|
||
|
|
|
||
|
|
1. **Startup probe** — Probe Ollama on `session_start` (adds ~1.5s if not running) or lazy on first `/model`? **Recommendation: async probe on session_start (non-blocking), eager if `OLLAMA_HOST` is set.**
|
||
|
|
2. **Auto-start** — Try to launch Ollama if installed but not running? **Recommendation: no — too invasive. Show helpful message in `/ollama` status.**
|
||
|
|
3. **Vision support** — Support multimodal models (llava, etc.) in Phase 2 native API? **Recommendation: yes, detected via capabilities table.**
|
||
|
|
4. **Model refresh** — How often to re-probe Ollama for new models? **Recommendation: on `/ollama list`, on `/model` command, and every 5 min (existing TTL).**
|