fix: replace ambiguous compound question in reflection step (#963) (#1002)

* Add CI/CD pipeline design spec Three-stage promotion pipeline (Dev → Test → Prod) using npm dist-tags, GitHub Environments, Docker images, and an LLM fixture recording system. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace ambiguous compound question in reflection step (#963) The reflection prompt 'Did I get that right, or did I miss something?' is a compound question where 'yes' maps to both possible answers. Replaced with 'Does that capture it? If not, tell me what I missed.' — one closed question plus an instruction, removing ambiguity. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 20:00:08 -04:00 · 2026-03-17 20:00:08 -04:00 · f432eea5a7
commit f432eea5a7
parent 01e28bc345
2 changed files with 270 additions and 1 deletions
--- a/docs/superpowers/specs/2026-03-17-cicd-pipeline-design.md
+++ b/docs/superpowers/specs/2026-03-17-cicd-pipeline-design.md
@ -0,0 +1,269 @@
+# CI/CD Pipeline Design — GSD 2
+
+## Overview
+
+A three-stage promotion pipeline for GSD 2 that moves merged PRs through Dev → Test → Prod using npm dist-tags as environment markers, GitHub Environments for approval gates, and Docker images for both CI acceleration and end-user distribution.
+
+## Goals
+
+1. Every merged PR is immediately installable via `npx gsd-pi@dev`
+2. Verified builds auto-promote to `@next` for early adopters
+3. Production releases require manual approval and optional live-LLM validation
+4. CI builds are fast and reproducible via pre-built Docker builder image
+5. End users can run GSD via Docker as an alternative to npm
+6. LLM-dependent behavior is testable without API calls via recorded fixtures
+
+## Non-Goals
+
+- Replacing the existing PR gate workflow (`ci.yml`)
+- Replacing the native binary cross-compilation workflow (`build-native.yml`)
+- Hosting GSD as a web service
+- Automated prompt regression testing (future work)
+
+## Pipeline Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    PR Merged to main                        │
+└──────────────────────────┬──────────────────────────────────┘
+                           ▼
+┌──────────────────────────────────────────────────────────────┐
+│  STAGE: DEV                          Environment: dev        │
+│                                                              │
+│  1. Build all packages (TS + Rust native)                    │
+│  2. Run existing unit + integration tests                    │
+│  3. Typecheck extensions                                     │
+│  4. Package validation (validate-pack)                       │
+│  5. npm publish gsd-pi@<version>-dev.<sha> --tag dev         │
+│  6. Smoke test: npx gsd-pi@dev --version                    │
+│                                                              │
+│  Docker: Build CI builder image (only if Dockerfile changed) │
+└──────────────────────────┬──────────────────────────────────┘
+                           ▼ (auto-promote if all green)
+┌──────────────────────────────────────────────────────────────┐
+│  STAGE: TEST                         Environment: test       │
+│                                                              │
+│  1. Install gsd-pi@dev from registry                         │
+│  2. CLI smoke tests (--version, init, help, config)          │
+│  3. Dry-run fixture suite (recorded LLM conversations)       │
+│     - Agent session replay with fixture provider             │
+│     - Tool use round-trips verified                          │
+│     - Extension loading validated                            │
+│  4. npm dist-tag add gsd-pi@<version> next                   │
+│                                                              │
+│  Docker: Build + push runtime image to GHCR as :next         │
+└──────────────────────────┬──────────────────────────────────┘
+                           ▼ (manual approval required)
+┌──────────────────────────────────────────────────────────────┐
+│  STAGE: PROD                         Environment: prod       │
+│                                                              │
+│  1. (Optional) Real LLM integration tests                    │
+│     - Gated behind workflow input flag                       │
+│     - Uses ANTHROPIC_API_KEY / OPENAI_API_KEY secrets        │
+│     - Budget-capped: small models, short conversations       │
+│  2. npm dist-tag add gsd-pi@<version> latest                 │
+│  3. GitHub Release created with changelog                    │
+│  4. Docker: tag runtime image as :latest + :v<version>       │
+│  5. Post-publish smoke test against @latest                  │
+└──────────────────────────────────────────────────────────────┘
+```
+
+### Version Strategy
+
+| Dist-tag | When published | Version format | Risk level |
+|----------|---------------|----------------|------------|
+| `@dev` | Every merged PR | `1.5.0-dev.a3f2c1b` | Bleeding edge |
+| `@next` | Auto-promoted from Dev | Same version, new tag | Candidate |
+| `@latest` | Manually approved from Test | Same version, new tag | Production |
+
+### Relationship to Existing Workflows
+
+| File | Trigger | Purpose | Status |
+|------|---------|---------|--------|
+| `ci.yml` | PR opened/updated | Pre-merge gate: build, test, typecheck | **Unchanged** |
+| `build-native.yml` | `v*` tag or manual dispatch | Cross-compile native binaries for 5 platforms | **Unchanged** |
+| `pipeline.yml` | Push to `main` | Post-merge promotion: Dev → Test → Prod | **New** |
+
+## Docker Images
+
+### Multi-Stage Dockerfile
+
+Two images from a single `Dockerfile` at the repo root.
+
+#### CI Builder Image
+
+- **Name:** `ghcr.io/gsd-build/gsd-ci-builder`
+- **Base:** `node:22-bookworm`
+- **Contains:** Node 22, Rust stable toolchain, `aarch64-linux-gnu` cross-compiler, Playwright system deps
+- **Size:** ~2.5 GB
+- **Rebuilt:** Only when `Dockerfile` changes
+- **Purpose:** Eliminates 3-5 min toolchain install on every CI run
+
+#### Runtime Image
+
+- **Name:** `ghcr.io/gsd-build/gsd-pi`
+- **Base:** `node:22-slim`
+- **Contains:** Node 22, git, `gsd-pi` installed globally
+- **Size:** ~250 MB
+- **Tags:** `:latest`, `:next`, `:v1.2.3`
+- **Published:** On every Prod promotion
+- **Purpose:** `docker run ghcr.io/gsd-build/gsd-pi` as alternative to `npx`
+
+### Why These Base Images
+
+- **Bookworm for CI:** The Rust native crates depend on vendored `libgit2`, image processing, and cross-compilation to ARM64. Debian Bookworm provides the full toolchain via apt. Alpine breaks due to musl vs glibc incompatibilities with N-API bindings.
+- **Slim for runtime:** Only needs Node + git. Native `.node` binaries are prebuilt and bundled in the npm package — no Rust toolchain needed at runtime.
+
+## LLM Fixture Recording & Replay System
+
+### Architecture
+
+The fixture system hooks into the `pi-ai` provider abstraction layer to capture and replay LLM conversations without hitting real APIs.
+
+```
+Agent Session
+    │
+    ▼
+pi-ai provider abstraction
+    │
+    ▼
+FixtureProvider (intercept layer)
+    │
+    ├── record mode → Real API + save to fixture JSON
+    │
+    └── replay mode → Load fixture JSON (no API call)
+```
+
+### Modes
+
+| Mode | Trigger | Behavior |
+|------|---------|----------|
+| **Record** | `GSD_FIXTURE_MODE=record GSD_FIXTURE_DIR=./fixtures` | Proxies to real API, saves request/response pairs |
+| **Replay** | `GSD_FIXTURE_MODE=replay GSD_FIXTURE_DIR=./fixtures` | Matches by turn index, returns saved response |
+| **Off** | Default (no env vars) | Normal operation |
+
+### Fixture Format
+
+One JSON file per recorded session:
+
+```json
+{
+  "name": "agent-creates-file",
+  "recorded": "2026-03-17T00:00:00Z",
+  "provider": "anthropic",
+  "model": "claude-sonnet-4-6",
+  "turns": [
+    {
+      "request": {
+        "messages": [{ "role": "user", "content": "Create hello.ts" }],
+        "tools": ["Write", "Read"],
+        "model": "claude-sonnet-4-6"
+      },
+      "response": {
+        "content": [
+          { "type": "text", "text": "I'll create hello.ts for you." },
+          { "type": "tool_use", "name": "Write", "input": { "file_path": "hello.ts", "content": "console.log('hello')" } }
+        ],
+        "stopReason": "toolUse",
+        "usage": { "input": 150, "output": 45 }
+      }
+    }
+  ]
+}
+```
+
+### Matching Strategy
+
+Turn-index based. Response N is served for request N in sequence. If the conversation diverges from the fixture, the test fails explicitly.
+
+Why not request-body hashing: request bodies contain timestamps, random IDs, and system prompt variations that cause brittle mismatches.
+
+Why not a generic HTTP VCR: The `pi-ai` layer abstracts 20+ providers with different wire formats. Intercepting above the transport means fixtures are provider-agnostic.
+
+### What Gets Tested via Fixtures
+
+- Agent session lifecycle (start → tool calls → completion)
+- Tool dispatch and response handling
+- Multi-turn conversation flow
+- Extension loading and routing
+- Error handling paths (fixtures can include error responses)
+
+### What Does NOT Get Tested (Deferred to Live Gate)
+
+- Model output quality
+- Prompt regression
+- New tool compatibility with live APIs
+
+### Fixture Storage
+
+Committed to repo under `tests/fixtures/recordings/`. Each fixture is 5-50KB of JSON. Recording is a manual developer action, not automated in CI.
+
+## New Files & Scripts
+
+### Directory Structure
+
+```
+tests/
+├── smoke/                     # CLI smoke tests (Stage: Test)
+│   ├── run.mjs
+│   ├── test-version.mjs
+│   ├── test-help.mjs
+│   └── test-init.mjs
+│
+├── fixtures/                  # Recorded LLM replay tests (Stage: Test)
+│   ├── run.ts                 # Test runner
+│   ├── record.ts              # Recording helper
+│   ├── provider.ts            # FixtureProvider intercept layer
+│   └── recordings/
+│       ├── agent-creates-file.json
+│       ├── agent-reads-and-edits.json
+│       ├── agent-handles-error.json
+│       └── agent-multi-turn-tools.json
+│
+├── live/                      # Real LLM tests (Stage: Prod, optional)
+│   ├── run.ts
+│   ├── test-anthropic-roundtrip.ts
+│   └── test-openai-roundtrip.ts
+│
+scripts/
+├── version-stamp.mjs          # Stamps <version>-dev.<sha>
+
+Dockerfile                     # Multi-stage: builder + runtime
+.github/workflows/pipeline.yml # Promotion pipeline
+```
+
+### New npm Scripts
+
+```json
+{
+  "test:smoke": "node tests/smoke/run.mjs",
+  "test:fixtures": "node --experimental-strip-types tests/fixtures/run.ts",
+  "test:fixtures:record": "GSD_FIXTURE_MODE=record node --experimental-strip-types tests/fixtures/record.ts",
+  "test:live": "GSD_LIVE_TESTS=1 node --experimental-strip-types tests/live/run.ts",
+  "pipeline:version-stamp": "node scripts/version-stamp.mjs",
+  "docker:build-runtime": "docker build --target runtime -t ghcr.io/gsd-build/gsd-pi .",
+  "docker:build-builder": "docker build --target builder -t ghcr.io/gsd-build/gsd-ci-builder ."
+}
+```
+
+## GitHub Configuration
+
+| Setting | Value |
+|---------|-------|
+| Environment: `dev` | No protection rules |
+| Environment: `test` | No protection rules (auto-promote) |
+| Environment: `prod` | Required reviewers: maintainers |
+| Secret: `NPM_TOKEN` | All environments |
+| Secret: `ANTHROPIC_API_KEY` | Prod only |
+| Secret: `OPENAI_API_KEY` | Prod only |
+| GHCR | Enabled for org |
+
+## Success Criteria
+
+1. A merged PR is installable via `npx gsd-pi@dev` within 10 minutes
+2. Fixture replay tests complete in under 60 seconds with zero API calls
+3. The full Dev → Test promotion completes without human intervention
+4. Prod promotion is blocked until a maintainer explicitly approves
+5. `docker run ghcr.io/gsd-build/gsd-pi --version` returns the correct version
+6. Existing `ci.yml` and `build-native.yml` workflows continue to work unchanged
+7. CI builder image reduces toolchain setup from ~3-5 min to ~30s pull
--- a/src/resources/extensions/gsd/prompts/discuss.md
+++ b/src/resources/extensions/gsd/prompts/discuss.md
@ -11,7 +11,7 @@ After the user describes their idea, **do not ask questions yet**. First, prove
 1. Summarize what you understood in your own words — concretely, not abstractly.
 2. Give an honest size read: roughly how many milestones, roughly how many slices in the first one. Base this on the actual work involved, not a classification label. A config change might be 1 milestone with 1 slice. A social network might be 5 milestones with 8+ slices each. Use your judgment.
 3. Include scope honesty — a bullet list of the major capabilities you're hearing: "Here's what I'm hearing: [bullet list of major capabilities]."
-4. Ask: "Did I get that right, or did I miss something?" — plain text, not `ask_user_questions`. Let them correct freely.
+4. Ask: "Does that capture it? If not, tell me what I missed." — plain text, not `ask_user_questions`. Let them correct freely.

 This prevents runaway questioning by forcing comprehension proof before anything else. Do not skip this step. Do not combine it with the first question round.