singularity-forge/.plans/ollama-native-provider.md

# Ollama Extension — First-Class Local LLM Support

## Status: DRAFT — Awaiting approval

## Problem

Ollama support in GSD2 currently requires manual `models.json` configuration. Users must:
1. Know the OpenAI-compatibility endpoint (`localhost:11434/v1`)
2. Manually list every model they want to use
3. Set compat flags (`supportsDeveloperRole: false`, etc.)
4. Use a dummy API key

There's an `ollama-cloud` provider for hosted Ollama, and a discovery adapter that can list models, but no first-class **local Ollama** extension that "just works."

## Goal

Make Ollama the easiest way to use GSD2 — zero config when Ollama is running locally. All Ollama functionality lives in a single extension: `src/resources/extensions/ollama/`.

## Architecture

Everything is a self-contained extension under `src/resources/extensions/ollama/`. The extension:
- Auto-detects Ollama on startup via health check
- Discovers and registers local models with the model registry
- Provides native Ollama API streaming (not OpenAI shim)
- Exposes `/ollama` slash commands for model management
- Registers an LLM-callable tool for model pull/status

Minimal core changes — only `KnownProvider` and `KnownApi` type additions in `pi-ai`, and `env-api-keys.ts` for key resolution. Everything else is in the extension.

## File Structure

```
src/resources/extensions/ollama/
├── index.ts                  # Extension entry — wires everything on session_start
├── ollama-client.ts          # HTTP client for Ollama REST API (/api/*)
├── ollama-discovery.ts       # Model discovery + capability detection
├── ollama-provider.ts        # Native /api/chat streaming provider (registers with pi-ai)
├── ollama-commands.ts        # /ollama slash commands (status, pull, list, remove, ps)
├── ollama-tool.ts            # LLM-callable tool for model management
├── model-capabilities.ts     # Known model capability table (context window, vision, reasoning)
└── types.ts                  # Shared types for Ollama API responses
```

## Scope

### Phase 1: Auto-Discovery + OpenAI-Compat Routing

**What:** Extension that auto-detects Ollama, discovers models, registers them using the existing `openai-completions` API provider. Zero config needed.

**Extension files:**
- `ollama/index.ts` — Main entry. On `session_start`:
  1. Probe `localhost:11434` (or `OLLAMA_HOST`) with 1.5s timeout
  2. If reachable, discover models via `/api/tags`
  3. Register discovered models with `ctx.modelRegistry` using correct defaults
  4. Show status widget if Ollama is detected
- `ollama/ollama-client.ts` — Low-level HTTP client:
  - `isRunning()` — `GET /` health check
  - `getVersion()` — `GET /api/version`
  - `listModels()` — `GET /api/tags`
  - `showModel(name)` — `POST /api/show` (details, template, parameters, size)
  - `getRunningModels()` — `GET /api/ps` (loaded models, VRAM usage)
  - `pullModel(name, onProgress)` — `POST /api/pull` (streaming progress)
  - `deleteModel(name)` — `DELETE /api/delete`
  - `copyModel(source, dest)` — `POST /api/copy`
  - Respects `OLLAMA_HOST` env var for non-default endpoints
- `ollama/ollama-discovery.ts` — Enhanced model discovery:
  - Calls `/api/tags` to get model list
  - Calls `/api/show` per model (batch, cached) to get:
    - `details.parameter_size` → estimate context window
    - `details.families` → detect vision (clip), reasoning (deepseek-r1)
    - `modelfile` → extract default parameters
  - Returns enriched `DiscoveredModel[]` with proper capabilities
- `ollama/model-capabilities.ts` — Known model lookup table:
  - Maps well-known model families to capabilities
  - e.g., `llama3.1` → `{ contextWindow: 131072, input: ["text"] }`
  - e.g., `llava` → `{ contextWindow: 4096, input: ["text", "image"] }`
  - e.g., `deepseek-r1` → `{ reasoning: true, contextWindow: 131072 }`
  - e.g., `qwen2.5-coder` → `{ contextWindow: 131072, input: ["text"] }`
  - Fallback: estimate from parameter count if not in table
- `ollama/types.ts` — Ollama API response types

**Core changes (minimal):**
- `packages/pi-ai/src/types.ts` — Add `"ollama"` to `KnownProvider`
- `packages/pi-ai/src/env-api-keys.ts` — Add `"ollama"` key resolution (returns `"ollama"` placeholder — no real key needed)
- `src/onboarding.ts` — Add `"ollama"` to provider selection list
- `src/wizard.ts` — Add `ollama` entry (no key required)

**Model registration details:**
Each discovered model registers as:
```typescript
{
  id: "llama3.1:8b",           // from /api/tags
  name: "Llama 3.1 8B",        // humanized
  api: "openai-completions",    // uses existing provider
  provider: "ollama",
  baseUrl: "http://localhost:11434/v1",
  cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
  reasoning: false,             // from capabilities table
  input: ["text"],              // from capabilities table
  contextWindow: 131072,        // from capabilities table or /api/show
  maxTokens: 16384,             // conservative default
  compat: {
    supportsDeveloperRole: false,
    supportsReasoningEffort: false,
    supportsUsageInStreaming: false,
    maxTokensField: "max_tokens",
  },
}
```

**Behavior:**
- `gsd --list-models` shows all locally-pulled Ollama models automatically
- `/model ollama/llama3.1:8b` works without any config file
- If Ollama isn't running, extension is silent — no errors, no models listed
- `models.json` overrides still work (user config wins over auto-discovery)

### Phase 2: Native Ollama API Provider (`/api/chat`)

**What:** A dedicated streaming provider that talks Ollama's native protocol instead of the OpenAI compatibility shim.

**Extension files:**
- `ollama/ollama-provider.ts` — Native `/api/chat` streaming:
  - Registers `"ollama-chat"` API with `registerApiProvider()`
  - Implements `stream()` and `streamSimple()`:
    - Maps GSD `Context` → Ollama messages format
    - Maps GSD `Tool[]` → Ollama tool format
    - Streams NDJSON responses, maps back to `AssistantMessage` events
    - Extracts `<think>` blocks for reasoning models (deepseek-r1, qwq)
  - Ollama-specific options:
    - `keep_alive` — control model memory retention (default: "5m")
    - `num_ctx` — pass through model's context window
    - `num_predict` — max output tokens
    - Temperature, top_p, top_k
  - Response metadata:
    - `eval_count` / `eval_duration` → tokens/sec in usage stats
    - `total_duration`, `load_duration` → performance visibility
  - Vision support: converts image content to base64 for multimodal models

**Core changes:**
- `packages/pi-ai/src/types.ts` — Add `"ollama-chat"` to `KnownApi`

**Phase 1 models switch to `api: "ollama-chat"` by default.** Users can force OpenAI-compat via `models.json` override if needed.

**Why native over OpenAI-compat:**
- Full `keep_alive` / `num_ctx` control
- Better error messages (Ollama-native vs generic OpenAI)
- More reliable tool calling on Ollama's native format
- Performance metrics in response (tokens/sec)
- Foundation for model management commands

### Phase 3: Local LLM Management UX

**What:** `/ollama` slash commands and an LLM tool for model management.

**Extension files:**
- `ollama/ollama-commands.ts` — Slash commands registered via `pi.registerCommand()`:
  - `/ollama` — Status overview:
    ```
    Ollama v0.5.7 — running (localhost:11434)

    Loaded:
      llama3.1:8b       4.7 GB VRAM   idle 3m

    Available:
      llama3.1:8b       (4.7 GB)
      qwen2.5-coder:7b  (4.4 GB)
      deepseek-r1:8b    (4.9 GB)
    ```
  - `/ollama pull <model>` — Pull with streaming progress via `ctx.ui.setWidget()`
  - `/ollama list` — List all local models with sizes and families
  - `/ollama remove <model>` — Delete a model (with confirmation)
  - `/ollama ps` — Running models + VRAM usage
- `ollama/ollama-tool.ts` — LLM-callable tool registered via `pi.registerTool()`:
  - `ollama_manage` tool — lets the agent pull/list/check models
  - Parameters: `{ action: "list" | "pull" | "status" | "ps", model?: string }`
  - Use case: agent detects it needs a model, pulls it automatically

**UX Flow:**
```
$ gsd
> /ollama
Ollama v0.5.7 — running (localhost:11434)
Loaded:
  llama3.1:8b    — 4.7 GB VRAM, idle 3m
Available:
  llama3.1:8b    (4.7 GB)
  qwen2.5-coder:7b (4.4 GB)
  deepseek-r1:8b (4.9 GB)

> /ollama pull codestral:22b
Pulling codestral:22b...
████████████████████████████░░░░ 78% (14.2 GB / 18.1 GB)
✓ codestral:22b ready

> /model ollama/codestral:22b
Switched to codestral:22b (local, Ollama)
```

## Implementation Order

1. **Phase 1** — Auto-discovery with OpenAI-compat routing. Biggest user impact, smallest risk.
2. **Phase 3** — Management UX (`/ollama` commands). Valuable even before native API.
3. **Phase 2** — Native `/api/chat` provider. Optimization over OpenAI-compat; do last.

## Core Changes Summary (minimal)

| File | Change |
|------|--------|
| `packages/pi-ai/src/types.ts` | Add `"ollama"` to `KnownProvider`, `"ollama-chat"` to `KnownApi` (Phase 2) |
| `packages/pi-ai/src/env-api-keys.ts` | Add `"ollama"` → always returns `"ollama"` placeholder |
| `src/onboarding.ts` | Add `"ollama"` to provider picker |
| `src/wizard.ts` | Add `"ollama"` key mapping (no key required) |

Everything else lives in `src/resources/extensions/ollama/`.

## Risks & Mitigations

| Risk | Mitigation |
|------|------------|
| Ollama not running — startup probe latency | 1.5s timeout; cache result; probe async so it doesn't block TUI paint |
| Model capabilities unknown | Known-model table + `/api/show` fallback + parameter_size estimation |
| Tool calling unreliable on small models | Detect param count; warn on <7B models |
| Ollama API changes between versions | Version detect via `/api/version`; stable endpoints only |
| Conflicts with `models.json` Ollama config | User config always wins; auto-discovered models merge beneath manual config |
| Extension disabled — no impact on core | Extension is additive; disabling removes all Ollama features cleanly |

## Testing Strategy

- Unit tests: `ollama-client.ts` with mocked fetch responses
- Unit tests: `ollama-discovery.ts` model capability parsing
- Unit tests: `ollama-provider.ts` message format mapping + NDJSON stream parsing
- Unit tests: `model-capabilities.ts` known model lookups
- Integration test: mock HTTP server simulating Ollama `/api/tags`, `/api/chat`, `/api/pull`
- Manual test: real Ollama instance with llama3.1, qwen2.5-coder, deepseek-r1

## Open Questions

1. **Startup probe** — Probe Ollama on `session_start` (adds ~1.5s if not running) or lazy on first `/model`? **Recommendation: async probe on session_start (non-blocking), eager if `OLLAMA_HOST` is set.**
2. **Auto-start** — Try to launch Ollama if installed but not running? **Recommendation: no — too invasive. Show helpful message in `/ollama` status.**
3. **Vision support** — Support multimodal models (llava, etc.) in Phase 2 native API? **Recommendation: yes, detected via capabilities table.**
4. **Model refresh** — How often to re-probe Ollama for new models? **Recommendation: on `/ollama list`, on `/model` command, and every 5 min (existing TTL).**
feat(extensions): add Ollama extension for first-class local LLM support (#3371) Self-contained extension at src/resources/extensions/ollama/ that auto-detects a running Ollama instance, discovers locally pulled models, and registers them as a first-class provider with zero configuration. Features: - Auto-discovery of local models via /api/tags on session_start - Capability detection (vision, reasoning, context window) for 40+ model families - /ollama slash command with status, list, pull, remove, ps subcommands - ollama_manage LLM-callable tool for agent-driven model operations - Onboarding flow with auto-detect (no API key required) - Non-blocking async probe — doesn't delay TUI paint - Respects OLLAMA_HOST env var for non-default endpoints Core changes (minimal): - Add "ollama" to KnownProvider in pi-ai types - Add "ollama" key resolution in env-api-keys.ts - Add "ollama" default model in model-resolver.ts - Add "Ollama (Local)" to onboarding wizard with probe flow 2026-04-01 09:37:31 -05:00			`# Ollama Extension — First-Class Local LLM Support`

			`## Status: DRAFT — Awaiting approval`

			`## Problem`

			Ollama support in GSD2 currently requires manual `models.json` configuration. Users must:
			1. Know the OpenAI-compatibility endpoint (`localhost:11434/v1`)
			`2. Manually list every model they want to use`
			3. Set compat flags (`supportsDeveloperRole: false`, etc.)
			`4. Use a dummy API key`

			There's an `ollama-cloud` provider for hosted Ollama, and a discovery adapter that can list models, but no first-class local Ollama extension that "just works."

			`## Goal`

			Make Ollama the easiest way to use GSD2 — zero config when Ollama is running locally. All Ollama functionality lives in a single extension: `src/resources/extensions/ollama/`.

			`## Architecture`

			Everything is a self-contained extension under `src/resources/extensions/ollama/`. The extension:
			`- Auto-detects Ollama on startup via health check`
			`- Discovers and registers local models with the model registry`
			`- Provides native Ollama API streaming (not OpenAI shim)`
			- Exposes `/ollama` slash commands for model management
			`- Registers an LLM-callable tool for model pull/status`

			Minimal core changes — only `KnownProvider` and `KnownApi` type additions in `pi-ai`, and `env-api-keys.ts` for key resolution. Everything else is in the extension.

			`## File Structure`

			```
			`src/resources/extensions/ollama/`
			`├── index.ts # Extension entry — wires everything on session_start`
			`├── ollama-client.ts # HTTP client for Ollama REST API (/api/*)`
			`├── ollama-discovery.ts # Model discovery + capability detection`
			`├── ollama-provider.ts # Native /api/chat streaming provider (registers with pi-ai)`
			`├── ollama-commands.ts # /ollama slash commands (status, pull, list, remove, ps)`
			`├── ollama-tool.ts # LLM-callable tool for model management`
			`├── model-capabilities.ts # Known model capability table (context window, vision, reasoning)`
			`└── types.ts # Shared types for Ollama API responses`
			```

			`## Scope`

			`### Phase 1: Auto-Discovery + OpenAI-Compat Routing`

			What: Extension that auto-detects Ollama, discovers models, registers them using the existing `openai-completions` API provider. Zero config needed.

			`Extension files:`
			- `ollama/index.ts` — Main entry. On `session_start`:
			1. Probe `localhost:11434` (or `OLLAMA_HOST`) with 1.5s timeout
			2. If reachable, discover models via `/api/tags`
			3. Register discovered models with `ctx.modelRegistry` using correct defaults
			`4. Show status widget if Ollama is detected`
			- `ollama/ollama-client.ts` — Low-level HTTP client:
			- `isRunning()` — `GET /` health check
			- `getVersion()` — `GET /api/version`
			- `listModels()` — `GET /api/tags`
			- `showModel(name)` — `POST /api/show` (details, template, parameters, size)
			- `getRunningModels()` — `GET /api/ps` (loaded models, VRAM usage)
			- `pullModel(name, onProgress)` — `POST /api/pull` (streaming progress)
			- `deleteModel(name)` — `DELETE /api/delete`
			- `copyModel(source, dest)` — `POST /api/copy`
			- Respects `OLLAMA_HOST` env var for non-default endpoints
			- `ollama/ollama-discovery.ts` — Enhanced model discovery:
			- Calls `/api/tags` to get model list
			- Calls `/api/show` per model (batch, cached) to get:
			- `details.parameter_size` → estimate context window
			- `details.families` → detect vision (clip), reasoning (deepseek-r1)
			- `modelfile` → extract default parameters
			- Returns enriched `DiscoveredModel[]` with proper capabilities
			- `ollama/model-capabilities.ts` — Known model lookup table:
			`- Maps well-known model families to capabilities`
			- e.g., `llama3.1` → `{ contextWindow: 131072, input: ["text"] }`
			- e.g., `llava` → `{ contextWindow: 4096, input: ["text", "image"] }`
			- e.g., `deepseek-r1` → `{ reasoning: true, contextWindow: 131072 }`
			- e.g., `qwen2.5-coder` → `{ contextWindow: 131072, input: ["text"] }`
			`- Fallback: estimate from parameter count if not in table`
			- `ollama/types.ts` — Ollama API response types

			`Core changes (minimal):`
			- `packages/pi-ai/src/types.ts` — Add `"ollama"` to `KnownProvider`
			- `packages/pi-ai/src/env-api-keys.ts` — Add `"ollama"` key resolution (returns `"ollama"` placeholder — no real key needed)
			- `src/onboarding.ts` — Add `"ollama"` to provider selection list
			- `src/wizard.ts` — Add `ollama` entry (no key required)

			`Model registration details:`
			`Each discovered model registers as:`
			```typescript
			`{`
			`id: "llama3.1:8b", // from /api/tags`
			`name: "Llama 3.1 8B", // humanized`
			`api: "openai-completions", // uses existing provider`
			`provider: "ollama",`
			`baseUrl: "http://localhost:11434/v1",`
			`cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },`
			`reasoning: false, // from capabilities table`
			`input: ["text"], // from capabilities table`
			`contextWindow: 131072, // from capabilities table or /api/show`
			`maxTokens: 16384, // conservative default`
			`compat: {`
			`supportsDeveloperRole: false,`
			`supportsReasoningEffort: false,`
			`supportsUsageInStreaming: false,`
			`maxTokensField: "max_tokens",`
			`},`
			`}`
			```

			`Behavior:`
			- `gsd --list-models` shows all locally-pulled Ollama models automatically
			- `/model ollama/llama3.1:8b` works without any config file
			`- If Ollama isn't running, extension is silent — no errors, no models listed`
			- `models.json` overrides still work (user config wins over auto-discovery)

			### Phase 2: Native Ollama API Provider (`/api/chat`)

			`What: A dedicated streaming provider that talks Ollama's native protocol instead of the OpenAI compatibility shim.`

			`Extension files:`
			- `ollama/ollama-provider.ts` — Native `/api/chat` streaming:
			- Registers `"ollama-chat"` API with `registerApiProvider()`
			- Implements `stream()` and `streamSimple()`:
			- Maps GSD `Context` → Ollama messages format
			- Maps GSD `Tool[]` → Ollama tool format
			- Streams NDJSON responses, maps back to `AssistantMessage` events
			- Extracts `<think>` blocks for reasoning models (deepseek-r1, qwq)
			`- Ollama-specific options:`
			- `keep_alive` — control model memory retention (default: "5m")
			- `num_ctx` — pass through model's context window
			- `num_predict` — max output tokens
			`- Temperature, top_p, top_k`
			`- Response metadata:`
			- `eval_count` / `eval_duration` → tokens/sec in usage stats
			- `total_duration`, `load_duration` → performance visibility
			`- Vision support: converts image content to base64 for multimodal models`

			`Core changes:`
			- `packages/pi-ai/src/types.ts` — Add `"ollama-chat"` to `KnownApi`

			Phase 1 models switch to `api: "ollama-chat"` by default. Users can force OpenAI-compat via `models.json` override if needed.

			`Why native over OpenAI-compat:`
			- Full `keep_alive` / `num_ctx` control
			`- Better error messages (Ollama-native vs generic OpenAI)`
			`- More reliable tool calling on Ollama's native format`
			`- Performance metrics in response (tokens/sec)`
			`- Foundation for model management commands`

			`### Phase 3: Local LLM Management UX`

			What: `/ollama` slash commands and an LLM tool for model management.

			`Extension files:`
			- `ollama/ollama-commands.ts` — Slash commands registered via `pi.registerCommand()`:
			- `/ollama` — Status overview:
			```
			`Ollama v0.5.7 — running (localhost:11434)`

			`Loaded:`
			`llama3.1:8b 4.7 GB VRAM idle 3m`

			`Available:`
			`llama3.1:8b (4.7 GB)`
			`qwen2.5-coder:7b (4.4 GB)`
			`deepseek-r1:8b (4.9 GB)`
			```
			- `/ollama pull <model>` — Pull with streaming progress via `ctx.ui.setWidget()`
			- `/ollama list` — List all local models with sizes and families
			- `/ollama remove <model>` — Delete a model (with confirmation)
			- `/ollama ps` — Running models + VRAM usage
			- `ollama/ollama-tool.ts` — LLM-callable tool registered via `pi.registerTool()`:
			- `ollama_manage` tool — lets the agent pull/list/check models
			- Parameters: `{ action: "list" \| "pull" \| "status" \| "ps", model?: string }`
			`- Use case: agent detects it needs a model, pulls it automatically`

			`UX Flow:`
			```
			`$ gsd`
			`> /ollama`
			`Ollama v0.5.7 — running (localhost:11434)`
			`Loaded:`
			`llama3.1:8b — 4.7 GB VRAM, idle 3m`
			`Available:`
			`llama3.1:8b (4.7 GB)`
			`qwen2.5-coder:7b (4.4 GB)`
			`deepseek-r1:8b (4.9 GB)`

			`> /ollama pull codestral:22b`
			`Pulling codestral:22b...`
			`████████████████████████████░░░░ 78% (14.2 GB / 18.1 GB)`
			`✓ codestral:22b ready`

			`> /model ollama/codestral:22b`
			`Switched to codestral:22b (local, Ollama)`
			```

			`## Implementation Order`

			`1. Phase 1 — Auto-discovery with OpenAI-compat routing. Biggest user impact, smallest risk.`
			2. Phase 3 — Management UX (`/ollama` commands). Valuable even before native API.
			3. Phase 2 — Native `/api/chat` provider. Optimization over OpenAI-compat; do last.

			`## Core Changes Summary (minimal)`

			`\| File \| Change \|`
			`\|------\|--------\|`
			\| `packages/pi-ai/src/types.ts` \| Add `"ollama"` to `KnownProvider`, `"ollama-chat"` to `KnownApi` (Phase 2) \|
			\| `packages/pi-ai/src/env-api-keys.ts` \| Add `"ollama"` → always returns `"ollama"` placeholder \|
			\| `src/onboarding.ts` \| Add `"ollama"` to provider picker \|
			\| `src/wizard.ts` \| Add `"ollama"` key mapping (no key required) \|

			Everything else lives in `src/resources/extensions/ollama/`.

			`## Risks & Mitigations`

			`\| Risk \| Mitigation \|`
			`\|------\|------------\|`
			`\| Ollama not running — startup probe latency \| 1.5s timeout; cache result; probe async so it doesn't block TUI paint \|`
			\| Model capabilities unknown \| Known-model table + `/api/show` fallback + parameter_size estimation \|
			`\| Tool calling unreliable on small models \| Detect param count; warn on <7B models \|`
			\| Ollama API changes between versions \| Version detect via `/api/version`; stable endpoints only \|
			\| Conflicts with `models.json` Ollama config \| User config always wins; auto-discovered models merge beneath manual config \|
			`\| Extension disabled — no impact on core \| Extension is additive; disabling removes all Ollama features cleanly \|`

			`## Testing Strategy`

			- Unit tests: `ollama-client.ts` with mocked fetch responses
			- Unit tests: `ollama-discovery.ts` model capability parsing
			- Unit tests: `ollama-provider.ts` message format mapping + NDJSON stream parsing
			- Unit tests: `model-capabilities.ts` known model lookups
			- Integration test: mock HTTP server simulating Ollama `/api/tags`, `/api/chat`, `/api/pull`
			`- Manual test: real Ollama instance with llama3.1, qwen2.5-coder, deepseek-r1`

			`## Open Questions`

			1. Startup probe — Probe Ollama on `session_start` (adds ~1.5s if not running) or lazy on first `/model`? Recommendation: async probe on session_start (non-blocking), eager if `OLLAMA_HOST` is set.
			2. Auto-start — Try to launch Ollama if installed but not running? Recommendation: no — too invasive. Show helpful message in `/ollama` status.
			`3. Vision support — Support multimodal models (llava, etc.) in Phase 2 native API? Recommendation: yes, detected via capabilities table.`
			4. Model refresh — How often to re-probe Ollama for new models? Recommendation: on `/ollama list`, on `/model` command, and every 5 min (existing TTL).