docs: update documentation for v2.27.0 release (#959)
README.md: - Added provider error recovery section (item 5) covering transient vs permanent error classification and auto-resume behavior - Updated crash recovery to mention headless auto-restart with backoff - Fixed numbered list (was 1-11 with duplicate 7, now 1-12 sequential) docs/parallel-orchestration.md: - Updated worker crash recovery section to reflect v2.27 persistent state — workers now survive crashes via disk state recovery
This commit is contained in:
parent
d2a9ee6024
commit
87122e0b7a
2 changed files with 12 additions and 8 deletions
18
README.md
18
README.md
|
|
@ -141,21 +141,23 @@ Auto mode is a state machine driven by files on disk. It reads `.gsd/STATE.md`,
|
|||
|
||||
3. **Git worktree isolation** — Each milestone runs in its own git worktree with a `milestone/<MID>` branch. All slice work commits sequentially — no branch switching, no merge conflicts. When the milestone completes, it's squash-merged to main as one clean commit.
|
||||
|
||||
4. **Crash recovery** — A lock file tracks the current unit. If the session dies, the next `/gsd auto` reads the surviving session file, synthesizes a recovery briefing from every tool call that made it to disk, and resumes with full context. Parallel orchestrator state is persisted to disk with PID liveness detection, so multi-worker sessions survive crashes too.
|
||||
4. **Crash recovery** — A lock file tracks the current unit. If the session dies, the next `/gsd auto` reads the surviving session file, synthesizes a recovery briefing from every tool call that made it to disk, and resumes with full context. Parallel orchestrator state is persisted to disk with PID liveness detection, so multi-worker sessions survive crashes too. In headless mode, crashes trigger automatic restart with exponential backoff (default 3 attempts).
|
||||
|
||||
5. **Stuck detection** — If the same unit dispatches twice (the LLM didn't produce the expected artifact), it retries once with a deep diagnostic. If it fails again, auto mode stops with the exact file it expected.
|
||||
5. **Provider error recovery** — Transient provider errors (rate limits, 500/503 server errors, overloaded) auto-resume after a delay. Permanent errors (auth, billing) pause for manual review. The model fallback chain retries transient network errors before switching models.
|
||||
|
||||
6. **Timeout supervision** — Soft timeout warns the LLM to wrap up. Idle watchdog detects stalls. Hard timeout pauses auto mode. Recovery steering nudges the LLM to finish durable output before giving up.
|
||||
6. **Stuck detection** — If the same unit dispatches twice (the LLM didn't produce the expected artifact), it retries once with a deep diagnostic. If it fails again, auto mode stops with the exact file it expected.
|
||||
|
||||
7. **Cost tracking** — Every unit's token usage and cost is captured, broken down by phase, slice, and model. The dashboard shows running totals and projections. Budget ceilings can pause auto mode before overspending.
|
||||
7. **Timeout supervision** — Soft timeout warns the LLM to wrap up. Idle watchdog detects stalls. Hard timeout pauses auto mode. Recovery steering nudges the LLM to finish durable output before giving up.
|
||||
|
||||
8. **Adaptive replanning** — After each slice completes, the roadmap is reassessed. If the work revealed new information that changes the plan, slices are reordered, added, or removed before continuing.
|
||||
8. **Cost tracking** — Every unit's token usage and cost is captured, broken down by phase, slice, and model. The dashboard shows running totals and projections. Budget ceilings can pause auto mode before overspending.
|
||||
|
||||
9. **Verification enforcement** — Configure shell commands (`npm run lint`, `npm run test`, etc.) that run automatically after task execution. Failures trigger auto-fix retries before advancing. Configurable via `verification_commands`, `verification_auto_fix`, and `verification_max_retries` preferences.
|
||||
9. **Adaptive replanning** — After each slice completes, the roadmap is reassessed. If the work revealed new information that changes the plan, slices are reordered, added, or removed before continuing.
|
||||
|
||||
10. **Milestone validation** — After all slices complete, a `validate-milestone` gate compares roadmap success criteria against actual results before sealing the milestone.
|
||||
10. **Verification enforcement** — Configure shell commands (`npm run lint`, `npm run test`, etc.) that run automatically after task execution. Failures trigger auto-fix retries before advancing. Configurable via `verification_commands`, `verification_auto_fix`, and `verification_max_retries` preferences.
|
||||
|
||||
11. **Escape hatch** — Press Escape to pause. The conversation is preserved. Interact with the agent, inspect what happened, or just `/gsd auto` to resume from disk state.
|
||||
11. **Milestone validation** — After all slices complete, a `validate-milestone` gate compares roadmap success criteria against actual results before sealing the milestone.
|
||||
|
||||
12. **Escape hatch** — Press Escape to pause. The conversation is preserved. Interact with the agent, inspect what happened, or just `/gsd auto` to resume from disk state.
|
||||
|
||||
### `/gsd` and `/gsd next` — Step Mode
|
||||
|
||||
|
|
|
|||
|
|
@ -292,6 +292,8 @@ All milestones are either complete or blocked by dependencies. Check `/gsd queue
|
|||
|
||||
### Worker crashed — how to recover
|
||||
|
||||
Workers now persist their state to disk automatically. If a worker process dies, the coordinator detects the dead PID via heartbeat expiry and marks the worker as crashed. On restart, the worker picks up from disk state — crash recovery, worktree re-entry, and completed-unit tracking carry over from the crashed session.
|
||||
|
||||
1. Run `/gsd doctor --fix` to clean up stale sessions
|
||||
2. Run `/gsd parallel status` to see current state
|
||||
3. Re-run `/gsd parallel start` to spawn new workers for remaining milestones
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue