singularity-forge/scripts
Mikael Hugo d54f18c95f feat(rpc): orphan-recovery + 10-min graceful shutdown for safe container swap
Two related changes to make blue/green upgrades (per scripts/upgrade-vega-
source-server.mjs) safe for in-flight self-feedback writes.

1. Startup orphan recovery (feedback-queue-recovery.ts, extracted module).
   Scans .sf/runtime/ for sf-feedback-queue.jsonl.<pid>(.<sid>)?.draining
   files left by previous processes. For each:
     - if our own session id: leave alone (live drain)
     - if PID is alive: leave alone (foreign drainer)
     - else: rename back to queue (only if no active queue file exists)
   Crash safety: when both an orphan AND an active queue exist, we DEFER
   recovery rather than merge — appending then unlinking would risk
   duplicate replay on crash. The next restart's recovery picks it up
   once the queue is naturally drained. Supports legacy filenames
   (.<pid>.draining, pre-session-id) for backward compat.

   Added SF_DRAIN_SESSION_ID (per-process 6-byte hex) stamped into the
   .draining filename. PID reuse across container restarts is normally
   safe because /proc clears, but the session id is a stronger guarantee
   that we don't trample a foreign drainer that happens to land on the
   same PID.

2. SIGTERM/SIGINT drain-then-exit handler (installGracefulShutdown).
   Drains the queue once on signal, then exits. Bounded by
   SF_RPC_SHUTDOWN_GRACE_MS (default 600_000 = 10 min). Rationale: if
   a drain is in flight, it MUST finish — losing self-feedback writes
   across a server upgrade is worse than a long wait. Normal drains
   complete in <1s; the 10-min ceiling is for pathological lock
   contention. Operator overrides via env var, or docker kill /
   kubectl delete --force for hard bypass.

   Upgrader script bumped to docker stop --timeout 610 (10s safety
   margin past the grace). k8s deployments must set
   terminationGracePeriodSeconds≥610 for the rolling-update path.

Tests: rpc-mode-orphan-recovery.test.ts — 7 cases covering empty,
no-orphans, dead-PID single recovery, both-files-deferred (codex's
crash-safety fix), live-PID untouched, multiple-dead-PIDs, malformed-
filename ignored.

Refs sf-mpa5kdpu (drainer orphans never recovered), sf-mpa4g46x
(original RPC hang). Codex adversarial-reviewed; the PID-reuse hardening
and crash-safety deferral landed per its feedback. Open follow-ups:
shutdown-aware /api/healthz returning 503 (codex point E), integrate
with existing forceShutdown ordering (codex point C).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:29:24 +02:00
..
tmp-check-test-imports sf snapshot: uncommitted changes after 246m inactivity 2026-05-17 08:28:04 +02:00
base64-scan.sh chore: purge bun from internal toolchain 2026-05-02 08:38:20 +02:00
build-web-if-stale.cjs feat: make sf server the operator entrypoint 2026-05-17 17:23:46 +02:00
bump-version.mjs refactor: rename pi-* packages to forge-native names (Phase 1) 2026-05-10 11:28:01 +02:00
check-circular-deps.mjs fix(lint): fix all pre-existing lint failures 2026-05-11 04:02:31 +02:00
check-protected-deletions.mjs fix: block extension declaration deletions 2026-05-05 18:28:07 +02:00
check-sf-extension-inventory.mjs feat: make sf server the operator entrypoint 2026-05-17 17:23:46 +02:00
check-skill-references.mjs chore: commit current workspace state 2026-05-05 14:46:18 +02:00
check-test-imports.mjs fix(scripts/check-test-imports): filter TS keywords + local declarations 2026-05-17 00:46:10 +02:00
check-test-imports.test.mjs fix(subagent-runner): drop spurious 10s STUCK warning on session.prompt 2026-05-16 23:49:43 +02:00
check-versioned-json.mjs sf snapshot: uncommitted changes after 43m inactivity 2026-05-05 21:39:56 +02:00
check-versioned-json.test.mjs sf snapshot: uncommitted changes after 43m inactivity 2026-05-05 21:39:56 +02:00
ci_monitor.cjs chore: commit current workspace state 2026-05-05 14:46:18 +02:00
ci_monitor.md feat: add GitHub Workflows skill with CI workflow and ci_monitor tool (#294) 2026-03-13 22:31:17 -06:00
copy-export-html.cjs refactor: rename review gate agent 2026-05-14 19:43:01 +02:00
copy-resources.cjs fix(auto): abort swarm checkpoint loops 2026-05-15 10:55:37 +02:00
copy-themes.cjs refactor: rename review gate agent 2026-05-14 19:43:01 +02:00
dev-cli.js chore: node 24 native APIs, import.meta.dirname, parsers rename, dep updates 2026-05-02 06:18:25 +02:00
dev-server.js feat(dev-server): auto-reload on SF extension + coding-agent + git upgrades 2026-05-17 21:03:49 +02:00
dev.js chore: commit current workspace state 2026-05-05 14:46:18 +02:00
docs-prompt-injection-scan.sh feat(ci): skip build/test for docs-only PRs and add prompt injection scan (#1699) 2026-03-21 08:39:03 -06:00
ensure-source-resources.cjs fix: auto-rebuild workspace packages when src is newer than dist 2026-05-17 17:19:25 +02:00
ensure-workspace-builds.cjs sf snapshot: uncommitted changes after 131m inactivity 2026-05-09 02:53:47 +02:00
generate-changelog.mjs chore: commit current workspace state 2026-05-05 14:46:18 +02:00
generate-features-inventory.mjs fix(lint): fix all pre-existing lint failures 2026-05-11 04:02:31 +02:00
generate-release-manifest.mjs feat: add source-mounted sf server self-deploy 2026-05-17 22:00:01 +02:00
install-hooks.mjs fix: block extension declaration deletions 2026-05-05 18:28:07 +02:00
install-hooks.sh refactor: update log prefixes and string values from gsd- to sf- namespace 2026-04-15 15:37:12 +02:00
install-pi-global.js sf snapshot: uncommitted changes after 131m inactivity 2026-05-09 02:53:47 +02:00
link-workspace-packages.cjs feat: replace launchd with systemd user-unit install path 2026-05-17 17:33:34 +02:00
model-smoke-benchmark.mjs refactor: rename pi-* packages to forge-native names (Phase 1) 2026-05-10 11:28:01 +02:00
parallel-monitor.mjs feat(sf): align uok task state and steering 2026-05-08 06:57:59 +02:00
postinstall.js feat: replace launchd with systemd user-unit install path 2026-05-17 17:33:34 +02:00
pr-risk-check.mjs sf snapshot: uncommitted changes after 43m inactivity 2026-05-05 21:39:56 +02:00
prepublish-check.mjs style: format repository with biome 2026-05-05 14:31:16 +02:00
preview-dashboard.ts refactor: rename pi-* packages to forge-native names (Phase 1) 2026-05-10 11:28:01 +02:00
recover-sf-1364.sh sf snapshot: uncommitted changes after 49m inactivity 2026-05-08 01:07:24 +02:00
require-tests.sh chore: sync workspace state after rebrand 2026-04-15 14:54:20 +02:00
rtk-benchmark.mjs chore: commit current workspace state 2026-05-05 14:46:18 +02:00
run-vega-source-server.mjs fix: use shared sf webserver project config 2026-05-17 22:09:28 +02:00
secret-scan.mjs feat(secret-scan): SF_SECURITY_FAST contract for the regex-only fast path 2026-05-14 07:57:02 +02:00
secret-scan.sh feat(secret-scan): SF_SECURITY_FAST contract for the regex-only fast path 2026-05-14 07:57:02 +02:00
stage-web-standalone.cjs style: format repository with biome 2026-05-05 14:31:16 +02:00
sync-pkg-version.cjs chore: commit current workspace state 2026-05-05 14:46:18 +02:00
test-replace.txt chore: sync workspace state after rebrand 2026-04-15 14:54:20 +02:00
test-reporter-compact.mjs style: format repository with biome 2026-05-05 14:31:16 +02:00
test-write.txt chore: sync workspace state after rebrand 2026-04-15 14:54:20 +02:00
uninstall-pi-global.js sf snapshot: uncommitted changes after 131m inactivity 2026-05-09 02:53:47 +02:00
update-changelog.mjs chore: commit current workspace state 2026-05-05 14:46:18 +02:00
upgrade-vega-source-server.mjs feat(rpc): orphan-recovery + 10-min graceful shutdown for safe container swap 2026-05-17 22:29:24 +02:00
validate-model-cost-table.mjs chore: formatter / linter touch-up (230 files) 2026-05-16 21:19:53 +02:00
validate-pack.js refactor(sf): separate daemon from server identity 2026-05-17 19:18:33 +02:00
validate-pack.sh refactor: update log prefixes and string values from gsd- to sf- namespace 2026-04-15 15:37:12 +02:00
verify-s03.sh refactor: rebrand gsd_ tool names and references to sf_ namespace 2026-04-15 15:51:38 +02:00
verify-s04.sh sf snapshot: pre-dispatch, uncommitted changes after 53m inactivity 2026-04-30 19:10:38 +02:00
version-stamp.mjs chore: commit current workspace state 2026-05-05 14:46:18 +02:00
watch-resources.js chore: commit current workspace state 2026-05-05 14:46:18 +02:00
with-env.mjs style: format repository with biome 2026-05-05 14:31:16 +02:00