From f432eea5a78917248238a4e1f490106a5788962a Mon Sep 17 00:00:00 2001 From: Tom Boucher Date: Tue, 17 Mar 2026 20:00:08 -0400 Subject: [PATCH] fix: replace ambiguous compound question in reflection step (#963) (#1002) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Add CI/CD pipeline design spec Three-stage promotion pipeline (Dev → Test → Prod) using npm dist-tags, GitHub Environments, Docker images, and an LLM fixture recording system. Co-Authored-By: Claude Opus 4.6 * fix: replace ambiguous compound question in reflection step (#963) The reflection prompt 'Did I get that right, or did I miss something?' is a compound question where 'yes' maps to both possible answers. Replaced with 'Does that capture it? If not, tell me what I missed.' — one closed question plus an instruction, removing ambiguity. --------- Co-authored-by: Claude Opus 4.6 --- .../specs/2026-03-17-cicd-pipeline-design.md | 269 ++++++++++++++++++ .../extensions/gsd/prompts/discuss.md | 2 +- 2 files changed, 270 insertions(+), 1 deletion(-) create mode 100644 docs/superpowers/specs/2026-03-17-cicd-pipeline-design.md diff --git a/docs/superpowers/specs/2026-03-17-cicd-pipeline-design.md b/docs/superpowers/specs/2026-03-17-cicd-pipeline-design.md new file mode 100644 index 000000000..50d821f78 --- /dev/null +++ b/docs/superpowers/specs/2026-03-17-cicd-pipeline-design.md @@ -0,0 +1,269 @@ +# CI/CD Pipeline Design — GSD 2 + +## Overview + +A three-stage promotion pipeline for GSD 2 that moves merged PRs through Dev → Test → Prod using npm dist-tags as environment markers, GitHub Environments for approval gates, and Docker images for both CI acceleration and end-user distribution. + +## Goals + +1. Every merged PR is immediately installable via `npx gsd-pi@dev` +2. Verified builds auto-promote to `@next` for early adopters +3. Production releases require manual approval and optional live-LLM validation +4. CI builds are fast and reproducible via pre-built Docker builder image +5. End users can run GSD via Docker as an alternative to npm +6. LLM-dependent behavior is testable without API calls via recorded fixtures + +## Non-Goals + +- Replacing the existing PR gate workflow (`ci.yml`) +- Replacing the native binary cross-compilation workflow (`build-native.yml`) +- Hosting GSD as a web service +- Automated prompt regression testing (future work) + +## Pipeline Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ PR Merged to main │ +└──────────────────────────┬──────────────────────────────────┘ + ▼ +┌──────────────────────────────────────────────────────────────┐ +│ STAGE: DEV Environment: dev │ +│ │ +│ 1. Build all packages (TS + Rust native) │ +│ 2. Run existing unit + integration tests │ +│ 3. Typecheck extensions │ +│ 4. Package validation (validate-pack) │ +│ 5. npm publish gsd-pi@-dev. --tag dev │ +│ 6. Smoke test: npx gsd-pi@dev --version │ +│ │ +│ Docker: Build CI builder image (only if Dockerfile changed) │ +└──────────────────────────┬──────────────────────────────────┘ + ▼ (auto-promote if all green) +┌──────────────────────────────────────────────────────────────┐ +│ STAGE: TEST Environment: test │ +│ │ +│ 1. Install gsd-pi@dev from registry │ +│ 2. CLI smoke tests (--version, init, help, config) │ +│ 3. Dry-run fixture suite (recorded LLM conversations) │ +│ - Agent session replay with fixture provider │ +│ - Tool use round-trips verified │ +│ - Extension loading validated │ +│ 4. npm dist-tag add gsd-pi@ next │ +│ │ +│ Docker: Build + push runtime image to GHCR as :next │ +└──────────────────────────┬──────────────────────────────────┘ + ▼ (manual approval required) +┌──────────────────────────────────────────────────────────────┐ +│ STAGE: PROD Environment: prod │ +│ │ +│ 1. (Optional) Real LLM integration tests │ +│ - Gated behind workflow input flag │ +│ - Uses ANTHROPIC_API_KEY / OPENAI_API_KEY secrets │ +│ - Budget-capped: small models, short conversations │ +│ 2. npm dist-tag add gsd-pi@ latest │ +│ 3. GitHub Release created with changelog │ +│ 4. Docker: tag runtime image as :latest + :v │ +│ 5. Post-publish smoke test against @latest │ +└──────────────────────────────────────────────────────────────┘ +``` + +### Version Strategy + +| Dist-tag | When published | Version format | Risk level | +|----------|---------------|----------------|------------| +| `@dev` | Every merged PR | `1.5.0-dev.a3f2c1b` | Bleeding edge | +| `@next` | Auto-promoted from Dev | Same version, new tag | Candidate | +| `@latest` | Manually approved from Test | Same version, new tag | Production | + +### Relationship to Existing Workflows + +| File | Trigger | Purpose | Status | +|------|---------|---------|--------| +| `ci.yml` | PR opened/updated | Pre-merge gate: build, test, typecheck | **Unchanged** | +| `build-native.yml` | `v*` tag or manual dispatch | Cross-compile native binaries for 5 platforms | **Unchanged** | +| `pipeline.yml` | Push to `main` | Post-merge promotion: Dev → Test → Prod | **New** | + +## Docker Images + +### Multi-Stage Dockerfile + +Two images from a single `Dockerfile` at the repo root. + +#### CI Builder Image + +- **Name:** `ghcr.io/gsd-build/gsd-ci-builder` +- **Base:** `node:22-bookworm` +- **Contains:** Node 22, Rust stable toolchain, `aarch64-linux-gnu` cross-compiler, Playwright system deps +- **Size:** ~2.5 GB +- **Rebuilt:** Only when `Dockerfile` changes +- **Purpose:** Eliminates 3-5 min toolchain install on every CI run + +#### Runtime Image + +- **Name:** `ghcr.io/gsd-build/gsd-pi` +- **Base:** `node:22-slim` +- **Contains:** Node 22, git, `gsd-pi` installed globally +- **Size:** ~250 MB +- **Tags:** `:latest`, `:next`, `:v1.2.3` +- **Published:** On every Prod promotion +- **Purpose:** `docker run ghcr.io/gsd-build/gsd-pi` as alternative to `npx` + +### Why These Base Images + +- **Bookworm for CI:** The Rust native crates depend on vendored `libgit2`, image processing, and cross-compilation to ARM64. Debian Bookworm provides the full toolchain via apt. Alpine breaks due to musl vs glibc incompatibilities with N-API bindings. +- **Slim for runtime:** Only needs Node + git. Native `.node` binaries are prebuilt and bundled in the npm package — no Rust toolchain needed at runtime. + +## LLM Fixture Recording & Replay System + +### Architecture + +The fixture system hooks into the `pi-ai` provider abstraction layer to capture and replay LLM conversations without hitting real APIs. + +``` +Agent Session + │ + ▼ +pi-ai provider abstraction + │ + ▼ +FixtureProvider (intercept layer) + │ + ├── record mode → Real API + save to fixture JSON + │ + └── replay mode → Load fixture JSON (no API call) +``` + +### Modes + +| Mode | Trigger | Behavior | +|------|---------|----------| +| **Record** | `GSD_FIXTURE_MODE=record GSD_FIXTURE_DIR=./fixtures` | Proxies to real API, saves request/response pairs | +| **Replay** | `GSD_FIXTURE_MODE=replay GSD_FIXTURE_DIR=./fixtures` | Matches by turn index, returns saved response | +| **Off** | Default (no env vars) | Normal operation | + +### Fixture Format + +One JSON file per recorded session: + +```json +{ + "name": "agent-creates-file", + "recorded": "2026-03-17T00:00:00Z", + "provider": "anthropic", + "model": "claude-sonnet-4-6", + "turns": [ + { + "request": { + "messages": [{ "role": "user", "content": "Create hello.ts" }], + "tools": ["Write", "Read"], + "model": "claude-sonnet-4-6" + }, + "response": { + "content": [ + { "type": "text", "text": "I'll create hello.ts for you." }, + { "type": "tool_use", "name": "Write", "input": { "file_path": "hello.ts", "content": "console.log('hello')" } } + ], + "stopReason": "toolUse", + "usage": { "input": 150, "output": 45 } + } + } + ] +} +``` + +### Matching Strategy + +Turn-index based. Response N is served for request N in sequence. If the conversation diverges from the fixture, the test fails explicitly. + +Why not request-body hashing: request bodies contain timestamps, random IDs, and system prompt variations that cause brittle mismatches. + +Why not a generic HTTP VCR: The `pi-ai` layer abstracts 20+ providers with different wire formats. Intercepting above the transport means fixtures are provider-agnostic. + +### What Gets Tested via Fixtures + +- Agent session lifecycle (start → tool calls → completion) +- Tool dispatch and response handling +- Multi-turn conversation flow +- Extension loading and routing +- Error handling paths (fixtures can include error responses) + +### What Does NOT Get Tested (Deferred to Live Gate) + +- Model output quality +- Prompt regression +- New tool compatibility with live APIs + +### Fixture Storage + +Committed to repo under `tests/fixtures/recordings/`. Each fixture is 5-50KB of JSON. Recording is a manual developer action, not automated in CI. + +## New Files & Scripts + +### Directory Structure + +``` +tests/ +├── smoke/ # CLI smoke tests (Stage: Test) +│ ├── run.mjs +│ ├── test-version.mjs +│ ├── test-help.mjs +│ └── test-init.mjs +│ +├── fixtures/ # Recorded LLM replay tests (Stage: Test) +│ ├── run.ts # Test runner +│ ├── record.ts # Recording helper +│ ├── provider.ts # FixtureProvider intercept layer +│ └── recordings/ +│ ├── agent-creates-file.json +│ ├── agent-reads-and-edits.json +│ ├── agent-handles-error.json +│ └── agent-multi-turn-tools.json +│ +├── live/ # Real LLM tests (Stage: Prod, optional) +│ ├── run.ts +│ ├── test-anthropic-roundtrip.ts +│ └── test-openai-roundtrip.ts +│ +scripts/ +├── version-stamp.mjs # Stamps -dev. + +Dockerfile # Multi-stage: builder + runtime +.github/workflows/pipeline.yml # Promotion pipeline +``` + +### New npm Scripts + +```json +{ + "test:smoke": "node tests/smoke/run.mjs", + "test:fixtures": "node --experimental-strip-types tests/fixtures/run.ts", + "test:fixtures:record": "GSD_FIXTURE_MODE=record node --experimental-strip-types tests/fixtures/record.ts", + "test:live": "GSD_LIVE_TESTS=1 node --experimental-strip-types tests/live/run.ts", + "pipeline:version-stamp": "node scripts/version-stamp.mjs", + "docker:build-runtime": "docker build --target runtime -t ghcr.io/gsd-build/gsd-pi .", + "docker:build-builder": "docker build --target builder -t ghcr.io/gsd-build/gsd-ci-builder ." +} +``` + +## GitHub Configuration + +| Setting | Value | +|---------|-------| +| Environment: `dev` | No protection rules | +| Environment: `test` | No protection rules (auto-promote) | +| Environment: `prod` | Required reviewers: maintainers | +| Secret: `NPM_TOKEN` | All environments | +| Secret: `ANTHROPIC_API_KEY` | Prod only | +| Secret: `OPENAI_API_KEY` | Prod only | +| GHCR | Enabled for org | + +## Success Criteria + +1. A merged PR is installable via `npx gsd-pi@dev` within 10 minutes +2. Fixture replay tests complete in under 60 seconds with zero API calls +3. The full Dev → Test promotion completes without human intervention +4. Prod promotion is blocked until a maintainer explicitly approves +5. `docker run ghcr.io/gsd-build/gsd-pi --version` returns the correct version +6. Existing `ci.yml` and `build-native.yml` workflows continue to work unchanged +7. CI builder image reduces toolchain setup from ~3-5 min to ~30s pull diff --git a/src/resources/extensions/gsd/prompts/discuss.md b/src/resources/extensions/gsd/prompts/discuss.md index 88d07dd03..282c8aaa4 100644 --- a/src/resources/extensions/gsd/prompts/discuss.md +++ b/src/resources/extensions/gsd/prompts/discuss.md @@ -11,7 +11,7 @@ After the user describes their idea, **do not ask questions yet**. First, prove 1. Summarize what you understood in your own words — concretely, not abstractly. 2. Give an honest size read: roughly how many milestones, roughly how many slices in the first one. Base this on the actual work involved, not a classification label. A config change might be 1 milestone with 1 slice. A social network might be 5 milestones with 8+ slices each. Use your judgment. 3. Include scope honesty — a bullet list of the major capabilities you're hearing: "Here's what I'm hearing: [bullet list of major capabilities]." -4. Ask: "Did I get that right, or did I miss something?" — plain text, not `ask_user_questions`. Let them correct freely. +4. Ask: "Does that capture it? If not, tell me what I missed." — plain text, not `ask_user_questions`. Let them correct freely. This prevents runaway questioning by forcing comprehension proof before anything else. Do not skip this step. Do not combine it with the first question round.