Three fixes addressing codex's adversarial review of the earlier orphan-
recovery / graceful-shutdown landing:
(1) Codex point B — single shutdown path. Removed the parallel
installGracefulShutdown() handler in rpc-mode.ts that was adding
a second SIGTERM listener and racing forceShutdown()'s teardown.
The drain is now the FIRST step inside forceShutdown() (before
killTrackedDetachedChildren / extension session_shutdown / etc.)
so DB writes complete cleanly while child processes are still
alive to flush. Race-free against the existing shutdown ordering.
(2) Codex point D — recovery-before-each-drain. Cloud-volume mtime
visibility lags between containers can mean an orphan `.draining`
file from a previous container isn't visible during the startup
scan but appears moments later. drainQueuedSfFeedbackCommands()
now runs recoverOrphanedFeedbackDrains() as its first step, so
each dispatch's drain sees the latest filesystem state.
(3) Codex point E — healthz returns 503 during shutdown. New module
src/web/shutdown-state.ts holds a per-process flag, auto-registers
SIGTERM/SIGINT/SIGHUP handlers on first read, and exposes a
snapshot (signal, startedAt, elapsedMs) for diagnostics. The
healthz route imports isShuttingDown() and returns 503 when set,
so k8s readinessProbe / Forgejo blue-green probes drain traffic
BEFORE we actually stop responding.
Tests:
- rpc-mode-orphan-recovery.test.ts: 8/8 still green
- web-shutdown-state.test.ts: 5/5 new — default false, mark sets
flag, idempotent, signal exposed via snapshot, null signal for
manual mark
Deferred to a follow-up commit (codex didn't flag, but noted for
completeness): a SIGTERM-drain child-process integration test that
spawns rpc-mode + sends a real signal. The 5 unit tests cover the
flag logic; the integration test would cover the full process tree
and is bulkier than the current commit warrants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>