From 88781cb722ad3a6d3ebd3afba249466dfebcf6d0 Mon Sep 17 00:00:00 2001
From: Lex Christopherson <lex@glittercowboy.com>
Date: Wed, 11 Mar 2026 16:20:39 -0600
Subject: [PATCH] =?UTF-8?q?docs:=20queue=20M003=20=E2=80=94=20AI-Driven=20?=
 =?UTF-8?q?Test=20Flows?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .gsd/DECISIONS.md                    |  33 +++++
 .gsd/PROJECT.md                      |  36 +++++
 .gsd/QUEUE.md                        |   7 +
 .gsd/REQUIREMENTS.md                 | 205 +++++++++++++++++++++++++++
 .gsd/milestones/M003/M003-CONTEXT.md | 133 +++++++++++++++++
 5 files changed, 414 insertions(+)
 create mode 100644 .gsd/DECISIONS.md
 create mode 100644 .gsd/PROJECT.md
 create mode 100644 .gsd/QUEUE.md
 create mode 100644 .gsd/REQUIREMENTS.md
 create mode 100644 .gsd/milestones/M003/M003-CONTEXT.md

diff --git a/.gsd/DECISIONS.md b/.gsd/DECISIONS.md
new file mode 100644
index 000000000..ae0071c20
--- /dev/null
+++ b/.gsd/DECISIONS.md
@@ -0,0 +1,33 @@
+# Decisions Register
+
+<!-- Append-only. Never edit or remove existing rows.
+     To reverse a decision, add a new row that supersedes it.
+     Read this file at the start of any planning or research phase. -->
+
+| # | When | Scope | Decision | Choice | Rationale | Revisable? |
+|---|------|-------|----------|--------|-----------|------------|
+| D001 | M001 | arch | Embedding strategy | SDK (`createAgentSession` + `InteractiveMode`) | Type-safe, no subprocess management, full control over storage/resources, cleanest branded app path per pi docs | No |
+| D002 | M001 | arch | State storage location | `~/.gsd/` (agent: `~/.gsd/agent/`, sessions: `~/.gsd/sessions/`) | Complete isolation from `~/.pi/`, clear brand identity, follows pi doc recommendation for branded apps | No |
+| D003 | M001 | arch | Branding mechanism | `PI_PACKAGE_DIR` env var set before pi internals load, pointing to gsd package root; gsd `package.json` declares `piConfig: { name: "gsd", configDir: ".gsd" }` | `config.js` reads `APP_NAME` from `piConfig.name` in the package.json found at `PI_PACKAGE_DIR`. Only mechanism that renames the TUI header without patching pi source. | Yes — if pi adds a dedicated `createAgentSession` appName option |
+| D004 | M001 | arch | Extension delivery | Copy extension `.ts` source into `src/resources/extensions/` at dev time; load via `DefaultResourceLoader.additionalExtensionPaths`; pi's jiti handles JIT compilation at runtime | Preserves pi's JIT compilation model, no separate build step for extensions, extensions stay readable source | Yes — if extension count grows large enough to warrant pre-compilation |
+| D005 | M001 | scope | Skills in M001 | Excluded — extensions only | User decision during discussion | Yes — M002 candidate |
+| D006 | M001 | scope | Plugin/install system | Deferred | Not MVP; bundled-only product for M001 | Yes — M002 candidate |
+| D007 | M001 | arch | pi interop | None — GSD never reads or writes `~/.pi/` | GSD is a product, not a pi config. Interop would blur the brand boundary. | No |
+| D008 | M001/S01 | verification | S01 verification strategy | Shell commands + real TTY launch (no test framework) | S01 is a pure binary launch / TUI branding check. The only meaningful assertion is whether the binary launches with "gsd" in the header — no unit-testable logic to isolate. Shell verification commands cover all must-haves. Test framework deferred to S02+ if needed. | Yes — add test framework in S02 if extension loading logic warrants it |
+| D009 | M001/S01 | arch | `files` array in package.json | Set in T03 during S01 (`["dist", "package.json", "README.md"]`) | Correct npm publish manifest must be in place before S04 pack/publish. Setting it early avoids a late-stage surprise. | No |
+| D010 | M001/S01/T02 | impl | ModelRegistry instantiation | Constructor `new ModelRegistry(authStorage)` — not a static factory | SDK types show no `.create()` on ModelRegistry; authStorage is passed directly to constructor. All other managers (AuthStorage, SettingsManager, SessionManager) use static `.create()` but synchronously. | No |
+| D011 | M001/S01/T02 | impl | InteractiveMode.run() | Instance method: `new InteractiveMode(session); mode.run()` — not static | SDK type declarations confirm `run()` is an instance method; static call would fail at runtime. | No |
+| D012 | M001/S01/T02 | impl | skipLibCheck in tsconfig | `skipLibCheck: true` added | `@google/genai` published types reference `@modelcontextprotocol/sdk` which is not installed as a type dep — causes transitive TS2307 error unrelated to gsd code. skipLibCheck is the standard fix for third-party type declaration issues. | Yes — remove if MCP types are added as a dep in the future |
+| D013 | M001/S01/T03 | arch | `PI_PACKAGE_DIR` shim directory (`pkg/`) | Added `pkg/` dir with `package.json` (piConfig) + `dist/modes/interactive/theme/` (pi theme JSONs) as the `PI_PACKAGE_DIR` target | `config.js::getThemesDir()` uses `getPackageDir()` (= PI_PACKAGE_DIR) and checks if `<dir>/src` exists; if yes, uses `src/modes/interactive/theme/` instead of `dist/`. Our project has a real `src/` dir, causing themes to resolve to the wrong path. Pointing PI_PACKAGE_DIR at `pkg/` (which has no `src/`) avoids the collision while still providing `piConfig` for branding. `pkg/dist/modes/interactive/theme/` is populated by `npm run copy-themes` (build script). | Yes — if pi adds a dedicated `appName` option to createAgentSession making PI_PACKAGE_DIR unnecessary |
+| D014 | M001/S02 | verification | S02 verification strategy | Shell commands + real TTY launch with stderr capture, no test framework | Extension loading is a runtime integration concern — no unit-testable logic to isolate. The meaningful assertions are: zero extension errors in stderr on launch, correct env vars in compiled loader.js, absence of `~/.pi/` refs in patched files. Shell commands cover all must-haves. Test framework deferred per D008. | Yes — add test framework if extension loading logic grows complex |
+| D015 | M001/S02 | arch | subagent spawn approach | `spawn(process.execPath, [GSD_BIN_PATH, ...extensionArgs, ...args])` — no `pi` binary in PATH | Patched subagent spawns node directly with the gsd dist/loader.js entrypoint. This ensures spawned subagents always use the bundled gsd extensions, regardless of what `pi` is in PATH. `GSD_BIN_PATH` = `process.argv[1]` from loader.ts. | Yes — if pi adds a native subagent spawn API |
+| D016 | M001/S02 | arch | shared/ is a library, not an extension entry point | `shared/` is NOT added to `additionalExtensionPaths` | `shared/ui.ts`, `shared/next-action-ui.ts` etc. are cross-extension imports, not independently registered extensions. They are discovered by jiti when gsd and ask-user-questions imports them via `../shared/*.js`. Adding shared/ as an extension entry point would attempt to register it as an extension (which it isn't). | No |
+| D017 | M001/S02 | arch | AGENTS.md first-run write | `initResources()` writes bundled AGENTS.md to `~/.gsd/agent/AGENTS.md` on first launch | pi's `loadProjectContextFiles` discovers AGENTS.md from `agentDir` (`~/.gsd/agent/`). On fresh install this file doesn't exist. One-time write on launch (behind existsSync check) ensures spawned subagents always pick up GSD's hard rules and execution heuristics. | No |
+| D018 | M001/S03 | arch | Wizard injection point | Pre-session: before `createAgentSession()`, not via `session_start` event hook | Running wizard before `createAgentSession()` ensures Anthropic key is in `authStorage` before `modelRegistry.getAvailable()` runs — avoids "No models available" fallback warning. S01 forward intelligence mentioned session_start hook; pre-session approach is strictly better because the session starts clean with a valid model. | Yes — if pi adds a native `beforeStart` or `authMissing` hook to `createAgentSession` |
+| D019 | M001/S03 | verification | S03 verification strategy | Shell script (`scripts/verify-s03.sh`) for automated non-TTY/skip checks + interactive UAT for masked input and TUI launch | Wizard involves TTY interaction that cannot be meaningfully automated (masked stdin, TUI launch). Automated shell script covers all non-interactive assertions (exit codes, error text, env hydration). Interactive UAT covers the remaining visual/interactive behaviors. No test framework added — consistent with D008/D014. | Yes — add test framework if wizard logic grows complex |
+| D020 | M001/S03 | arch | Wizard scope | Optional tool keys only (Brave/Context7/Jina) — Anthropic auth is pi's responsibility via OAuth | Wizard collecting Anthropic key was redundant (pi already handles it) and interfered with verify script automation. Optional-key scope satisfies R006. | Yes — if pi adds a native "no Anthropic key" callback hook |
+| D021 | M001/S04 | arch | GSD_BUNDLED_EXTENSION_PATHS target | agentDir-based paths, not src/resources paths | When subagent spawns a child gsd process via --extension flags, the child also runs initResources + buildResourceLoader from agentDir. src/resources paths ≠ agentDir paths → pi deduplication fails → duplicate tool registration errors. Pointing to agentDir paths means both the --extension args and agentDir scan resolve identically → deduplication works. Safe because subagent spawning only happens after initResources has synced on first launch. | No |
+| D022 | M001/S04 | verification | S04 verification strategy | 10-check `scripts/verify-s04.sh` for tarball install path; registry publish check automated; interactive UAT for wizard fire from clean install | Tarball install + launch is automatable (env isolation, background kill). Registry install check is automatable (prefix install + stderr check). Wizard TTY interaction is UAT-only. Consistent with D008/D014/D019 — shell scripts, no test framework. | Yes — add test framework if automated E2E is needed later |
+| D023 | M003 | arch | Test flow execution model | Intent-based YAML specs, not deterministic scripts — agent interprets verify blocks with full adaptive intelligence | Evaluated Maestro (JVM dep, deterministic scripting, mobile-first) and decided against embedding or cloning it. GSD's advantage is AI-in-the-loop. Flows describe what to verify; the agent decides how. Faster iteration, better flakiness handling, plays to GSD's strength. | Yes — could add deterministic fast-path for simple assertions later |
+| D024 | M003 | arch | Test browser isolation | test-flows runs its own Playwright instance, separate from browser-tools | Test execution must not be polluted by development browser state (cookies, auth, DOM mutations). Two Playwright instances in one process is supported. Keeps test-flows extension fully decoupled from browser-tools. | No |
+| D025 | M003 | arch | Maestro integration | Not embedded — optional external tool if user installs it | Maestro requires JVM, adds ~200MB+ footprint, its YAML format is deterministic scripts not intent specs. GSD builds its own testing arm. Maestro MCP could be wired in later as an optional extension for users who want it. | Yes — could add maestro MCP wrapper extension later |
diff --git a/.gsd/PROJECT.md b/.gsd/PROJECT.md
new file mode 100644
index 000000000..d6afd5428
--- /dev/null
+++ b/.gsd/PROJECT.md
@@ -0,0 +1,36 @@
+# Project
+
+## What This Is
+
+GSD 2.0 is a branded npm CLI (`npm install -g gsd-pi`) that ships the full GSD coding agent experience as a standalone product. It embeds `@mariozechner/pi-coding-agent` via SDK, stores state in `~/.gsd/`, bundles the GSD extension, all supporting extensions, agents, and AGENTS.md context, and runs pi's `InteractiveMode` under the `gsd` brand. Users run `gsd` — not `pi`.
+
+## Core Value
+
+A single `npm install -g gsd-pi` gives any developer a fully configured, GSD-branded coding agent with the GSD extension, all supporting tools (browser, search, context7, subagent, bg-shell, etc.), and a first-run setup wizard that collects API keys — ready to use in under two minutes.
+
+## Current State
+
+M001/S01, S02, and S03 complete. `gsd` binary compiles and launches with "gsd" TUI branding. All 11 bundled extensions load without errors. State goes to `~/.gsd/`. `~/.pi/` is untouched. AGENTS.md auto-deployed to `~/.gsd/agent/` on first launch. First-run wizard fires for missing optional keys (Brave/Context7/Jina), stores them with masked input, and skips on subsequent launches. Only S04 (npm publish and install smoke test) remains.
+
+Key structural artifact: `pkg/` shim directory — `PI_PACKAGE_DIR` points here (not project root) to avoid pi's `getThemesDir()` collision with our real `src/` dir. Committed; `pkg/dist/modes/interactive/theme/` populated by `npm run copy-themes` at build time.
+
+## Architecture / Key Patterns
+
+- **SDK embedding**: `@mariozechner/pi-coding-agent` imported as a library via `createAgentSession` + `InteractiveMode`
+- **Branded app directories**: state lives in `~/.gsd/agent/`, sessions in `~/.gsd/sessions/` (constants in `src/app-paths.ts`)
+- **Branding via `PI_PACKAGE_DIR`**: env var set in `src/loader.ts` before any pi SDK loads; points to `pkg/` shim; `pkg/package.json` declares `piConfig: { name: "gsd", configDir: ".gsd" }`
+- **Two-file loader pattern**: `loader.ts` (sets env vars, zero SDK imports, dynamic-imports `cli.js`) → `cli.ts` (static SDK imports, wires all managers)
+- **pkg/ shim**: lean subdirectory — only `package.json` (piConfig) and `dist/modes/interactive/theme/` (pi theme assets). No `src/`. Avoids `getThemesDir()` src-check collision.
+- **Bundled extensions**: GSD extension + 10 supporting extensions in `src/resources/extensions/`; loaded via `buildResourceLoader()` → `DefaultResourceLoader.additionalExtensionPaths`; all 11 load clean on launch
+- **Bundled agents + AGENTS.md**: scout, researcher, worker in `src/resources/agents/`; `initResources()` writes bundled AGENTS.md to `~/.gsd/agent/` on first launch (existsSync guard)
+- **4 GSD_ env vars**: set in loader.ts before cli.js loads — `GSD_CODING_AGENT_DIR`, `GSD_BIN_PATH`, `GSD_WORKFLOW_PATH`, `GSD_BUNDLED_EXTENSION_PATHS`
+- **First-run wizard**: `src/wizard.ts` — detects missing optional keys (Brave/Context7/Jina), prompts with masked TTY input, writes to `~/.gsd/agent/auth.json`; `loadStoredEnvKeys` hydrates env on every launch before extensions load
+
+## Capability Contract
+
+See `.gsd/REQUIREMENTS.md` for the explicit capability contract, requirement status, and coverage mapping.
+
+## Milestone Sequence
+
+- [ ] M001: MVP CLI — `npm install -g gsd-pi` installs, launches, and runs with all bundled extensions and first-run setup
+- [ ] M003: AI-Driven Test Flows — intent-based YAML test specs the agent writes during development and executes autonomously at UAT time (browser, mac, api targets)
diff --git a/.gsd/QUEUE.md b/.gsd/QUEUE.md
new file mode 100644
index 000000000..3160d3feb
--- /dev/null
+++ b/.gsd/QUEUE.md
@@ -0,0 +1,7 @@
+# Queue
+
+<!-- Append-only log of queued milestones. -->
+
+| # | Queued | Milestone | Title | Depends On | Notes |
+|---|--------|-----------|-------|------------|-------|
+| 1 | 2026-03-11 | M003 | AI-Driven Test Flows | M001 (bundled extension infrastructure) | Intent-based YAML test specs — browser, mac, api targets — with flow-driven UAT type for autonomous execution at slice completion |
diff --git a/.gsd/REQUIREMENTS.md b/.gsd/REQUIREMENTS.md
new file mode 100644
index 000000000..cb4f7cf99
--- /dev/null
+++ b/.gsd/REQUIREMENTS.md
@@ -0,0 +1,205 @@
+# Requirements
+
+This file is the explicit capability and coverage contract for GSD 2.0.
+
+## Active
+
+### R001 — Single-command install
+
+- Class: primary-user-loop
+- Status: validated
+- Description: `npm install -g gsd-pi` installs the gsd CLI and all bundled resources in a single command with no additional manual steps required
+- Why it matters: The whole product promise is zero-friction install. If install requires manual steps, the product fails its core pitch.
+- Source: user
+- Primary owning slice: M001/S01
+- Supporting slices: M001/S04
+- Validation: S04 — npm install -g gsd-pi from registry installs working binary; zero extension load errors; R001 fully validated
+
+### R002 — Branded identity
+
+- Class: differentiator
+- Status: validated
+- Description: The CLI is named `gsd`, state lives in `~/.gsd/`, the TUI header shows "gsd", and no pi branding is visible to the user in normal operation
+- Why it matters: GSD 2.0 is a product, not a pi config. Users should experience a coherent branded tool.
+- Source: user
+- Primary owning slice: M001/S01
+- Supporting slices: none
+- Validation: S01 — TUI header confirmed "gsd" via live runtime launch; piConfig.name=gsd, piConfig.configDir=.gsd verified; ~/.gsd/ confirmed created
+
+### R003 — Bundled GSD extension
+
+- Class: core-capability
+- Status: validated
+- Description: The `/gsd` command, auto-mode, GSD dashboard (Ctrl+Alt+G), and all GSD workflow commands work out of the box with no additional configuration
+- Why it matters: The GSD extension is the primary reason users install this tool.
+- Source: user
+- Primary owning slice: M001/S02
+- Supporting slices: none
+- Validation: S02 — gsd extension loads without errors on launch (zero stderr extension errors confirmed); interactive /gsd command use deferred to S04 UAT
+
+### R004 — Bundled supporting extensions
+
+- Class: core-capability
+- Status: validated
+- Description: All extensions from `~/.pi/agent/extensions/` ship bundled: browser-tools, search-the-web, context7, subagent, bg-shell, worktree, plan-mode, slash-commands, ask-user-questions, get-secrets-from-user
+- Why it matters: These extensions are what make the agent useful as a coding agent. GSD without browser tools, web search, and subagent is significantly less capable.
+- Source: user
+- Primary owning slice: M001/S02
+- Supporting slices: none
+- Validation: S02 — all 10 supporting extensions load without errors (zero stderr extension errors on launch); functional tool use (browser launch, web search) deferred to S04 UAT
+
+### R005 — Bundled agents and AGENTS.md
+
+- Class: core-capability
+- Status: validated
+- Description: The scout, researcher, and worker agents are bundled and available. The AGENTS.md hard rules and execution heuristics are loaded as the default agent context.
+- Why it matters: Agents and AGENTS.md define how the model behaves. Without them, subagent delegation and model discipline don't work.
+- Source: user
+- Primary owning slice: M001/S02
+- Supporting slices: none
+- Validation: S02 — scout.md, researcher.md, worker.md present in src/resources/agents/; AGENTS.md (15,070 bytes) written to ~/.gsd/agent/ on first launch via initResources()
+
+### R006 — First-run setup wizard
+
+- Class: launchability
+- Status: validated
+- Description: On first run, if optional tool API keys (Brave, Context7, Jina) are missing, a wizard prompts for them with masked input. Keys are stored in `~/.gsd/agent/auth.json` and hydrated into process.env on every launch. Wizard does not run on subsequent starts if keys are already configured. Anthropic auth is handled by pi's OAuth/API key flow — not the wizard.
+- Why it matters: Without API keys, nothing works. A wizard that detects and collects missing keys turns a broken first run into a successful one.
+- Source: user
+- Primary owning slice: M001/S03
+- Supporting slices: none
+- Validation: S03 — automated verify script (6/6 pass) + interactive UAT; wizard fires for missing optional keys, stores them, TUI launches, rerun skips wizard
+
+### R007 — Isolated state in ~/.gsd/
+
+- Class: quality-attribute
+- Status: validated
+- Description: All GSD state (auth, sessions, settings, logs) lives in `~/.gsd/`, completely separate from `~/.pi/`. Installing gsd must not modify or read a user's existing pi configuration.
+- Why it matters: Users may have an existing pi installation. GSD must not corrupt or interfere with it.
+- Source: inferred
+- Primary owning slice: M001/S01
+- Supporting slices: none
+- Validation: S01 — ~/.gsd/agent/ and ~/.gsd/sessions/ created after launch; ~/.pi/agent/sessions/ count unchanged (28/28) before and after gsd run
+
+### R008 — npm update workflow
+
+- Class: continuity
+- Status: validated
+- Description: `npm update -g gsd-pi` installs a new version with updated bundled resources. The update is clean — no stale extension files from old versions.
+- Why it matters: Software that can't update cleanly accumulates technical debt and breaks silently.
+- Source: user
+- Primary owning slice: M001/S04
+- Supporting slices: none
+- Validation: S04 — cpSync force:true in initResources ensures npm update -g replaces bundled resources; tarball smoke test confirms clean install path
+
+### R009 — Observable failure state
+
+- Class: failure-visibility
+- Status: validated
+- Description: If optional tool API keys are missing in a non-interactive run, the warning is actionable: it names the missing providers. Extension load failures are surfaced, not silently swallowed.
+- Why it matters: Silent failures are debugging nightmares. A future agent or user must be able to localize what broke without guessing.
+- Source: inferred
+- Primary owning slice: M001/S03
+- Supporting slices: M001/S02
+- Validation: S03 — non-TTY warning names all three missing providers (Brave Search, Context7, Jina); cat ~/.gsd/agent/auth.json shows stored state; extension load failure surface from S02 confirmed intact
+
+### R010 — Test flow execution
+
+- Class: core-capability
+- Status: active
+- Description: The agent can write YAML test specifications during development and execute them against browser, mac, and api targets via `run_test_flow` and `run_test_suite` tools. Flows use intent-based verification blocks (verify/given/expect) that the agent interprets adaptively. Browser tests run in a fresh isolated Playwright session.
+- Why it matters: Closes the gap between "agent builds a feature" and "agent proves it works" — durable, re-runnable test artifacts that survive context wipes.
+- Source: user
+- Primary owning slice: M003 (TBD)
+- Supporting slices: none
+- Validation: unmapped
+
+### R011 — Flow-driven UAT
+
+- Class: core-capability
+- Status: active
+- Description: GSD auto-mode recognizes `flow-driven` as a UAT type. At slice completion, the UAT pipeline automatically executes all flow files in the slice's `flows/` directory and writes structured pass/fail results to the UAT result file.
+- Why it matters: Makes UAT fully autonomous for slices with test flows — no human intervention needed for UI/API verification.
+- Source: user
+- Primary owning slice: M003 (TBD)
+- Supporting slices: none
+- Validation: unmapped
+
+## Deferred
+
+### R020 — Plugin system
+
+- Class: differentiator
+- Status: deferred
+- Description: Allow users to install additional pi packages on top of GSD via `gsd install npm:pkg`
+- Why it matters: Makes GSD extensible beyond what ships in the box
+- Source: inferred
+- Primary owning slice: none
+- Supporting slices: none
+- Validation: unmapped
+- Notes: Deferred — M001 ships bundled-only. Plugin support is explicitly post-MVP.
+
+### R021 — Skills bundle
+
+- Class: core-capability
+- Status: deferred
+- Description: Ship the skills from `~/.pi/agent/skills/` as bundled GSD skills
+- Why it matters: Skills provide specialized workflows
+- Source: user
+- Primary owning slice: none
+- Supporting slices: none
+- Validation: unmapped
+- Notes: User explicitly excluded skills from M001. Can add in M002.
+
+## Out of Scope
+
+### R030 — pi compatibility / interoperability
+
+- Class: anti-feature
+- Status: out-of-scope
+- Description: GSD does not read from or write to `~/.pi/`. There is no migration from pi to gsd. No `pi install npm:gsd` target.
+- Why it matters: Prevents scope confusion. GSD is a product, not a pi extension.
+- Source: user
+- Primary owning slice: none
+- Supporting slices: none
+- Validation: n/a
+- Notes: Explicitly out of scope by architecture decision.
+
+### R031 — Web/desktop UI
+
+- Class: constraint
+- Status: out-of-scope
+- Description: GSD 2.0 is terminal-only. No web UI, no Electron wrapper, no RPC mode.
+- Why it matters: Keeps scope focused on the CLI product.
+- Source: inferred
+- Primary owning slice: none
+- Supporting slices: none
+- Validation: n/a
+- Notes: `pi-web-ui` and RPC mode explicitly not used.
+
+## Traceability
+
+| ID   | Class              | Status       | Primary owner | Supporting | Proof    |
+| ---- | ------------------ | ------------ | ------------- | ---------- | -------- |
+| R001 | primary-user-loop  | validated    | M001/S01      | M001/S04   | S04 — npm install -g gsd-pi from registry; zero extension errors; binary confirmed |
+| R002 | differentiator     | validated    | M001/S01      | none       | S01 — TUI shows "gsd", piConfig confirmed, ~/.gsd/ confirmed |
+| R003 | core-capability    | validated    | M001/S02      | none       | S02 — gsd extension loads clean; interactive /gsd use deferred to S04 |
+| R004 | core-capability    | validated    | M001/S02      | none       | S02 — all 10 supporting extensions load without errors; functional use deferred to S04 |
+| R005 | core-capability    | validated    | M001/S02      | none       | S02 — agents present; AGENTS.md (15,070 bytes) written to ~/.gsd/agent/ on first launch |
+| R006 | launchability      | validated    | M001/S03      | none       | S03 — optional-key wizard fires, stores, skips on rerun |
+| R007 | quality-attribute  | validated    | M001/S01      | none       | S01 — ~/.gsd/ created; ~/.pi/ sessions unchanged (28/28) |
+| R008 | continuity         | validated    | M001/S04      | none       | S04 — cpSync force:true; tarball smoke confirms clean install path |
+| R009 | failure-visibility | validated    | M001/S03      | M001/S02   | S03 — non-TTY warning names missing providers; extension errors surface confirmed |
+| R020 | differentiator     | deferred     | none          | none       | unmapped |
+| R021 | core-capability    | deferred     | none          | none       | unmapped |
+| R010 | core-capability    | active       | M003 (TBD)    | none       | unmapped |
+| R011 | core-capability    | active       | M003 (TBD)    | none       | unmapped |
+| R030 | anti-feature       | out-of-scope | none          | none       | n/a      |
+| R031 | constraint         | out-of-scope | none          | none       | n/a      |
+
+## Coverage Summary
+
+- Active requirements: 11
+- Mapped to slices: 9
+- Validated: 9 (R001, R002, R003, R004, R005, R006, R007, R008, R009)
+- Unmapped active requirements: 2 (R010, R011 — pending M003 planning)
diff --git a/.gsd/milestones/M003/M003-CONTEXT.md b/.gsd/milestones/M003/M003-CONTEXT.md
new file mode 100644
index 000000000..9363a614d
--- /dev/null
+++ b/.gsd/milestones/M003/M003-CONTEXT.md
@@ -0,0 +1,133 @@
+# M003: AI-Driven Test Flows — Context
+
+**Gathered:** 2026-03-11
+**Status:** Queued — pending auto-mode execution
+
+## Project Description
+
+A new GSD extension (`test-flows`) that introduces intent-based YAML test specifications the agent writes during development and executes autonomously at UAT time. Flows describe **what to verify** (not mechanical step-by-step scripts), and the agent interprets each verification block using its full adaptive intelligence — choosing selectors, handling flakiness, retrying intelligently, and diagnosing failures.
+
+Supports three target surfaces: **browser** (web apps via Playwright), **mac** (native macOS apps via Accessibility APIs), and **api** (HTTP request/response verification).
+
+This is GSD's testing arm — the thing that closes the loop between "agent builds a feature" and "agent proves it works."
+
+## Why This Milestone
+
+GSD's current UAT pipeline has a gap: `artifact-driven` UAT runs shell commands and file checks, while `live-runtime` and `human-experience` UAT punt to the human. There is no way for the agent to write durable, re-runnable UI/API tests during development that execute automatically at UAT time.
+
+The agent already has the tools (`browser_*`, `mac_*`, `bash` for HTTP) — what's missing is a structured format for persisting test intent and a runner that orchestrates execution against fresh isolated sessions. This milestone fills that gap.
+
+The insight from Maestro evaluation: don't compete with Maestro as a standalone deterministic test runner. Instead, leverage what GSD is uniquely good at — AI-driven adaptive execution of test specifications. The YAML files are intent specs, not scripts. The AI handles the "how."
+
+## User-Visible Outcome
+
+### When this milestone is complete, the user can:
+
+- See the agent write `.yaml` test flow files during slice development that describe what to verify
+- Have UAT run automatically at slice completion — the agent executes all flow files and writes a structured pass/fail report
+- Read `S01-UAT-RESULT.md` with per-flow, per-verification pass/fail results, timing, screenshots on failure, and diagnostic context
+- Manually trigger test flows via the agent calling `run_test_flow` or `run_test_suite` tools at any time
+- Test web apps (browser target), macOS apps (mac target), and APIs (api target) from the same flow format
+
+### Entry point / environment
+
+- Entry point: LLM tool calls (`run_test_flow`, `run_test_suite`) + GSD auto-mode UAT pipeline
+- Environment: local dev (macOS terminal running `gsd`)
+- Live dependencies involved: Playwright (bundled), mac-tools Swift CLI (bundled), HTTP via Node fetch (built-in)
+
+## Completion Class
+
+- Contract complete means: flow YAML parser validates correctly, runner executes all three targets (browser/mac/api) and returns structured results, `flow-driven` UAT type is recognized by the auto-mode pipeline
+- Integration complete means: agent writes flows during development, auto-mode UAT dispatches `run_test_suite`, results appear in `S01-UAT-RESULT.md`, failures include screenshots and diagnostics
+- Operational complete means: the full loop works end-to-end in a real GSD auto-mode session — agent builds a web feature, writes test flows, completes the slice, UAT runs the flows, report is written
+
+## Final Integrated Acceptance
+
+To call this milestone complete, we must prove:
+
+- Agent can write a browser-target flow YAML during development, and `run_test_flow` executes it against a running local web app with correct pass/fail results
+- Agent can write a mac-target flow YAML, and it executes against a real macOS app (e.g., TextEdit) with correct pass/fail results
+- Agent can write an api-target flow YAML with HTTP request/response checks, and it executes correctly
+- `flow-driven` UAT type triggers automatic test suite execution at slice completion in auto-mode, with results written to the UAT result file
+- Test execution uses a fresh isolated browser session, not the agent's development browser
+- Failures include actionable diagnostics: screenshots, console logs (browser), element state (mac), response bodies (api)
+
+## Risks and Unknowns
+
+- **Inter-extension isolation** — The test-flows extension must run its own Playwright browser instance, separate from browser-tools' instance. Two Playwright instances in the same process should work (Playwright supports it), but needs verification. If they conflict, the runner may need to use a subprocess.
+- **Mac-tools CLI access** — The test-flows extension needs to call the mac-tools Swift CLI binary directly. The binary is compiled on first use by the mac-tools extension. test-flows must either wait for mac-tools to compile it first, or handle compilation itself. Need to determine the right approach.
+- **Agent flow authoring quality** — The value depends on Claude writing good test specifications during development. If the generated flows are too vague or too brittle, the system fails in practice. This is a prompt engineering challenge, not a code challenge. The system prompt guidelines for the tool must be excellent.
+- **Adaptive execution reliability** — Each `verify` block is interpreted by the LLM. Non-determinism means a flow might pass one run and fail the next. Need to design the execution model to minimize this (clear verify/expect structure, retries, good diagnostics on failure).
+- **Execution model for verify blocks** — The runner tool receives a YAML flow and must execute each verify block. Since extensions can't call other extensions' tools, the runner must use Playwright/mac-tools/fetch directly (not via `browser_*` tools). This means reimplementing some of the smart waiting/settling logic from browser-tools. Alternatively, each verify block could be dispatched as an LLM sub-turn — but that's expensive and slow. The right balance needs to be found.
+
+## Existing Codebase / Prior Art
+
+- `src/resources/extensions/browser-tools/index.ts` — Full Playwright browser automation extension (~4990 lines). Reference for Playwright patterns, adaptive settling, assertion evaluation, screenshot capture. The test-flows runner will import Playwright directly rather than calling these tools.
+- `src/resources/extensions/browser-tools/core.js` — Runtime-neutral helpers: action timeline, assertion evaluation (`evaluateAssertionChecks`), compact state diffing. May be importable by test-flows.
+- `src/resources/extensions/mac-tools/index.ts` — macOS Accessibility API automation via Swift CLI. Reference for how to invoke the Swift CLI binary (`execFileSync` with JSON protocol).
+- `src/resources/extensions/gsd/auto.ts` — GSD auto-mode engine. Contains `checkNeedsRunUat()`, `buildRunUatPrompt()`, UAT dispatch logic. Must be modified to support `flow-driven` UAT type.
+- `src/resources/extensions/gsd/files.ts` — Contains `extractUatType()` which classifies UAT types from markdown content. Must be extended with `flow-driven`.
+- `src/resources/extensions/gsd/prompts/run-uat.md` — UAT execution prompt template. Must be extended with `flow-driven` instructions.
+- `src/resources/extensions/gsd/templates/uat.md` — UAT file template. Must include `flow-driven` as a valid UAT mode.
+- Maestro (external, not embedded) — Inspiration for YAML flow format and "arm's length" testing philosophy. Not a dependency. Key takeaways: declarative YAML syntax, smart waiting, accessibility-layer interaction, cross-platform unified format.
+
+> See `.gsd/DECISIONS.md` for all architectural and pattern decisions — it is an append-only register; read it during planning, append to it during execution.
+
+## Relevant Requirements
+
+- R003 (Bundled GSD extension) — This extends the GSD extension's UAT pipeline with a new type
+- R004 (Bundled supporting extensions) — This adds a new bundled extension (`test-flows`)
+- New requirement candidates:
+  - R010 — Test flow execution: agent can write and execute YAML test specifications against browser, mac, and api targets
+  - R011 — Flow-driven UAT: auto-mode recognizes `flow-driven` UAT type and executes test suites automatically at slice completion
+
+## Scope
+
+### In Scope
+
+- New `test-flows` extension in `src/resources/extensions/test-flows/`
+- YAML flow format: header (name, target, url/app/endpoint) + verification blocks (verify/given/expect)
+- Flow parser with validation and clear error messages
+- Browser target runner: own Playwright instance, fresh context per flow, smart waiting, screenshot capture
+- Mac target runner: direct Swift CLI invocation, element resolution, screenshot capture
+- API target runner: HTTP requests via Node fetch, status/header/body assertions
+- Two LLM tools: `run_test_flow` (single flow) and `run_test_suite` (directory of flows)
+- Structured result output: per-flow, per-verification pass/fail, timing, screenshots, diagnostics
+- New `flow-driven` UAT type in GSD extension (`files.ts`, `auto.ts`, `run-uat.md`, `uat.md`)
+- System prompt guidelines that teach the agent when and how to write good test flows
+- Flow files stored alongside slices: `.gsd/milestones/M00X/slices/S0X/flows/*.yaml`
+
+### Out of Scope / Non-Goals
+
+- Maestro compatibility (not a goal — different format, different execution model)
+- Visual regression testing / image diffing (future enhancement)
+- Parallel flow execution / sharding (future enhancement)
+- CI/CD integration or headless-only mode (future enhancement)
+- Flow recording / interactive flow authoring UI (future enhancement — Maestro Studio equivalent)
+- Mobile device/simulator testing (would require Maestro or Appium — out of scope)
+
+## Technical Constraints
+
+- Must be a pi extension following existing patterns (`export default function(pi: ExtensionAPI)`)
+- Must use TypeBox for tool parameter schemas, StringEnum for enums
+- Must truncate tool output to stay within context limits
+- Browser runner must use a separate Playwright instance from browser-tools (test isolation)
+- Mac runner must invoke the Swift CLI binary at the known path (`src/resources/extensions/mac-tools/swift-cli/.build/release/mac-agent`)
+- No new npm dependencies beyond what's already bundled (Playwright, yaml parsing via existing means)
+- Extension loads via `additionalExtensionPaths` — same mechanism as all other bundled extensions
+
+## Integration Points
+
+- `browser-tools` extension — Shares Playwright dependency but NOT browser state. test-flows runs its own Playwright instance.
+- `mac-tools` extension — test-flows calls the same Swift CLI binary but independently. Must handle the case where the binary hasn't been compiled yet.
+- `gsd` extension — UAT pipeline integration: `files.ts` (extractUatType), `auto.ts` (checkNeedsRunUat, buildRunUatPrompt), `prompts/run-uat.md`, `templates/uat.md`
+- `src/loader.ts` / `src/cli.ts` — test-flows must be added to `GSD_BUNDLED_EXTENSION_PATHS` and `initResources()` file sync
+- Playwright — Direct import for browser automation (already a dependency of the project)
+- Node.js `fetch` — For API target HTTP requests (built into Node 18+)
+
+## Open Questions
+
+- **Verify block execution model** — Should each `verify` block be executed by deterministic code (parse expect clauses, run Playwright assertions) or by sending the block to the LLM as a sub-task? Deterministic is faster and cheaper but less adaptive. LLM sub-task is more flexible but slower and non-deterministic. Hybrid approach (deterministic for simple assertions, LLM for complex "verify this looks right" blocks) may be the sweet spot. Needs design decision in planning.
+- **YAML parsing** — Use `js-yaml` (would need to add as dependency) or parse the simple format manually? The format is simple enough that a hand-rolled parser might suffice and avoids a new dep.
+- **Mac binary compilation timing** — If test-flows needs the mac-tools binary and it hasn't been compiled yet, should test-flows trigger compilation or just fail with a clear message? Triggering compilation would duplicate logic from mac-tools extension.
+- **Flow file discovery for UAT** — When `run_test_suite` is called for a slice's flows, should it discover files by convention (all `.yaml` in the `flows/` dir) or should the UAT file explicitly list which flows to run?