singularity-forge/packages
Jeremy McSpadden 0e0f47ef9f fix: failure recovery & resume safeguards (all 4 waves) (#956)
* fix: prevent data loss on crash with atomic writes, file locking, and error handling

Wave 1 of failure recovery safeguards:

1. Atomic session file rewrites (tmp+rename) — _rewriteFile() and forkFrom()
   now use atomicWriteFileSync to prevent session file corruption on crash
2. Atomic auto.lock writes — crash-recovery.ts writeLock() uses tmp+rename
   so the crash detection system itself can't be corrupted
3. unhandledRejection handler — catches silent process death from unhandled
   promise rejections in OAuth, extensions, LSP, or MCP connections
4. try/catch in emitToolCall — matches pattern used by emitUserBash,
   emitContext, and emitToolResult to prevent extension handler crashes
   from killing the entire agent turn
5. File locking on session appends — prevents concurrent pi instances from
   interleaving partial JSON lines in session JSONL files using the same
   proper-lockfile pattern established in auth-storage.ts and settings-manager.ts

* fix: add OAuth timeouts, RPC exit detection, and command context guards

Wave 2 of failure recovery safeguards:

1. OAuth fetch timeouts — all fetch() calls across all OAuth providers
   (Anthropic, OpenAI Codex, Google Antigravity, Google Gemini CLI,
   GitHub Copilot) now have 30-second AbortSignal.timeout() to prevent
   indefinite hangs when OAuth servers are unresponsive
2. RPC subprocess exit detection — pending requests are now rejected
   when the agent subprocess exits unexpectedly, preventing indefinite
   hangs in the RPC client
3. Extension command context guards — default handlers for newSession,
   fork, navigateTree, switchSession, and reload now throw explicit
   errors instead of silently returning success when called before
   bindCommandContext()
4. OAuth error detail preservation — token refresh errors now preserve
   the original error as `cause` for better diagnostics

* fix: resource cleanup, LSP retry, and crash detection on session resume

Wave 3 of failure recovery safeguards:

1. Atomic completed-units.json cleanup — milestone completion writes
   now use tmp+rename pattern for consistency with auto-recovery.ts
2. Bash temp file cleanup — track temp files created for large output
   and register a process exit handler to clean them up
3. Settings write queue flush on shutdown — call settingsManager.flush()
   during interactive mode shutdown so queued writes aren't lost
4. LSP initialization retry — wrap getOrCreateClient with up to 2 retries
   with exponential backoff (1s, 2s) for transient spawn failures
5. Crash detection on session resume — wasInterrupted() checks if last
   assistant turn had tool calls without results, shows warning on resume

* fix: blob garbage collection and LSP debug logging

Wave 4 of failure recovery safeguards:

1. Blob garbage collection — BlobStore.gc(referencedHashes) removes
   orphaned blobs not referenced by any session file, plus totalSize()
   for monitoring blob directory growth
2. LSP JSON parse error logging — malformed LSP messages are now logged
   at debug level (when DEBUG env is set) instead of being silently dropped
2026-03-17 16:03:49 -06:00
..
native perf: optimize discovery and interactive hot paths 2026-03-14 16:03:44 -05:00
pi-agent-core feat: render native web search in TUI + PREFER_BRAVE_SEARCH toggle (#806) 2026-03-16 23:35:20 -06:00
pi-ai fix: failure recovery & resume safeguards (all 4 waves) (#956) 2026-03-17 16:03:49 -06:00
pi-coding-agent fix: failure recovery & resume safeguards (all 4 waves) (#956) 2026-03-17 16:03:49 -06:00
pi-tui fix: single ENTER submits slash command argument autocomplete (#944) (#953) 2026-03-17 15:28:40 -06:00