singularity-forge/infra
Mikael Hugo d23b998194 feat(infra/srv): Traefik fronting for zero-downtime sf-server upgrades
New /infra/srv/ tree: production-style Docker compose that puts Traefik
in front of sf-server. Closes the orchestration gaps the bare-docker
upgrader (scripts/upgrade-vega-source-server.mjs) couldn't address:

  1. Health-check-driven drain. Traefik polls /api/healthz every 2s.
     The moment SF receives SIGTERM, src/web/shutdown-state.ts flips
     the in-process flag and the route returns 503 (landed in
     f8e53840d). ~4s later Traefik removes the replica from the pool
     — new traffic stops, in-flight requests finish.

  2. Sticky sessions via the `sf-aff` cookie. /api/session/events SSE
     streams (and any other long-lived per-replica state) survive
     client reconnects within the upgrade window because Traefik
     pins the cookie to the same replica until that replica is gone.

  3. Blue/green via the `sf-candidate` service. Guarded by Docker
     compose profile=candidate so production traffic keeps flowing to
     `sf` until the operator promotes. Image swap is then atomic from
     a client perspective — old replica goes 503, new replica picks
     up traffic before old container actually stops.

  4. stop_grace_period: 610s matching SF_RPC_SHUTDOWN_GRACE_MS=600000.
     If a self-feedback queue drain is in flight when SIGTERM lands,
     it MUST finish. Losing writes across an upgrade is worse than the
     wait. Hard-bypass via `docker kill` if the operator chooses; the
     .draining file then gets recovered on the next start via
     feedback-queue-recovery's startup scan.

infra/srv/README.md documents the runbook: bring-up, upgrade flow,
env vars, TLS notes, and what this does NOT replace (the existing
Dockerfile, k8s/Forgejo CI flow, and the source-server upgrader).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:38:29 +02:00
..
srv feat(infra/srv): Traefik fronting for zero-downtime sf-server upgrades 2026-05-17 22:38:29 +02:00