New /infra/srv/ tree: production-style Docker compose that puts Traefik
in front of sf-server. Closes the orchestration gaps the bare-docker
upgrader (scripts/upgrade-vega-source-server.mjs) couldn't address:
1. Health-check-driven drain. Traefik polls /api/healthz every 2s.
The moment SF receives SIGTERM, src/web/shutdown-state.ts flips
the in-process flag and the route returns 503 (landed in
f8e53840d). ~4s later Traefik removes the replica from the pool
— new traffic stops, in-flight requests finish.
2. Sticky sessions via the `sf-aff` cookie. /api/session/events SSE
streams (and any other long-lived per-replica state) survive
client reconnects within the upgrade window because Traefik
pins the cookie to the same replica until that replica is gone.
3. Blue/green via the `sf-candidate` service. Guarded by Docker
compose profile=candidate so production traffic keeps flowing to
`sf` until the operator promotes. Image swap is then atomic from
a client perspective — old replica goes 503, new replica picks
up traffic before old container actually stops.
4. stop_grace_period: 610s matching SF_RPC_SHUTDOWN_GRACE_MS=600000.
If a self-feedback queue drain is in flight when SIGTERM lands,
it MUST finish. Losing writes across an upgrade is worse than the
wait. Hard-bypass via `docker kill` if the operator chooses; the
.draining file then gets recovered on the next start via
feedback-queue-recovery's startup scan.
infra/srv/README.md documents the runbook: bring-up, upgrade flow,
env vars, TLS notes, and what this does NOT replace (the existing
Dockerfile, k8s/Forgejo CI flow, and the source-server upgrader).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>