singularity-forge/infra/srv
Mikael Hugo d23b998194 feat(infra/srv): Traefik fronting for zero-downtime sf-server upgrades
New /infra/srv/ tree: production-style Docker compose that puts Traefik
in front of sf-server. Closes the orchestration gaps the bare-docker
upgrader (scripts/upgrade-vega-source-server.mjs) couldn't address:

  1. Health-check-driven drain. Traefik polls /api/healthz every 2s.
     The moment SF receives SIGTERM, src/web/shutdown-state.ts flips
     the in-process flag and the route returns 503 (landed in
     f8e53840d). ~4s later Traefik removes the replica from the pool
     — new traffic stops, in-flight requests finish.

  2. Sticky sessions via the `sf-aff` cookie. /api/session/events SSE
     streams (and any other long-lived per-replica state) survive
     client reconnects within the upgrade window because Traefik
     pins the cookie to the same replica until that replica is gone.

  3. Blue/green via the `sf-candidate` service. Guarded by Docker
     compose profile=candidate so production traffic keeps flowing to
     `sf` until the operator promotes. Image swap is then atomic from
     a client perspective — old replica goes 503, new replica picks
     up traffic before old container actually stops.

  4. stop_grace_period: 610s matching SF_RPC_SHUTDOWN_GRACE_MS=600000.
     If a self-feedback queue drain is in flight when SIGTERM lands,
     it MUST finish. Losing writes across an upgrade is worse than the
     wait. Hard-bypass via `docker kill` if the operator chooses; the
     .draining file then gets recovered on the next start via
     feedback-queue-recovery's startup scan.

infra/srv/README.md documents the runbook: bring-up, upgrade flow,
env vars, TLS notes, and what this does NOT replace (the existing
Dockerfile, k8s/Forgejo CI flow, and the source-server upgrader).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:38:29 +02:00
..
docker-compose.yaml feat(infra/srv): Traefik fronting for zero-downtime sf-server upgrades 2026-05-17 22:38:29 +02:00
README.md feat(infra/srv): Traefik fronting for zero-downtime sf-server upgrades 2026-05-17 22:38:29 +02:00

SF server infra: Traefik + sf-server with zero-downtime upgrades

Production deployment of sf-server behind a Traefik reverse-proxy. Closes the orchestration gaps in the bare-docker upgrader (scripts/upgrade-vega- source-server.mjs) by adding:

  • Health-check-driven traffic drain. Traefik polls /api/healthz every 2s. The moment SF receives SIGTERM, src/web/shutdown-state.ts flips the flag and the route returns 503. After ~4s Traefik removes the container from the load-balancer pool.
  • Cookie-based sticky sessions. /api/session/events SSE streams survive client reconnects within an upgrade window because Traefik routes the same sf-aff cookie to the same replica until that replica is gone.
  • Blue/green via weighted services. The sf-candidate service runs alongside sf under a separate Traefik service. Operator flips weights to roll traffic gradually; old container drains; old removed.

Files

File Purpose
docker-compose.yaml Traefik + sf + sf-candidate services with full label set
(this README) Operator runbook

Quick start (local dev / single-host prod)

# 1. Set required env (see `Environment` below)
export SF_IMAGE=ghcr.io/singularity-ng/sf-server:$(git rev-parse HEAD)
export SF_HOSTNAME=sf.localhost           # or your real hostname
export SF_WORKSPACE_DIR=/var/lib/sf/workspace

# 2. Bring everything up
docker compose -f infra/srv/docker-compose.yaml up -d

# 3. Sanity check
curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/healthz
curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/ready
curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/version

Zero-downtime upgrade

# 1. Build the new image
export SF_CANDIDATE_IMAGE=ghcr.io/singularity-ng/sf-server:$(git rev-parse HEAD)
docker pull ${SF_CANDIDATE_IMAGE}

# 2. Bring up the candidate (profile=candidate gates it off by default)
docker compose -f infra/srv/docker-compose.yaml --profile candidate up -d sf-candidate

# 3. Verify candidate health BEFORE flipping traffic
docker exec sf-server-candidate curl -fsS http://localhost:4000/api/healthz
docker exec sf-server-candidate curl -fsS http://localhost:4000/api/ready

# 4. Flip Traefik to send traffic to the candidate by promoting it to the
#    primary service. The cleanest path is to relabel the candidate's
#    routers to match `sf`'s rule, OR use a Traefik weighted-service
#    middleware (see https://doc.traefik.io/traefik/routing/services/#weighted-round-robin
#    — requires the dynamic-config provider, NOT the docker-labels-only path).
#    For now: stop the old, start it as the new with candidate's image.
docker compose -f infra/srv/docker-compose.yaml stop sf
# Traefik now has only the candidate in its pool → traffic flows there.

# 5. Replace `sf` with the new image and start it
SF_IMAGE=${SF_CANDIDATE_IMAGE} \
  docker compose -f infra/srv/docker-compose.yaml up -d sf

# 6. Traefik picks up the new `sf` automatically (via docker label
#    discovery); both services exist for ~2-4s while health-checks
#    converge, then `sf-candidate` can be retired.
docker compose -f infra/srv/docker-compose.yaml --profile candidate down sf-candidate

Environment

Variable Default Purpose
SF_IMAGE ghcr.io/singularity-ng/sf-server:latest Primary container image
SF_CANDIDATE_IMAGE ghcr.io/singularity-ng/sf-server:candidate Blue/green candidate image
SF_HOSTNAME sf.localhost Public hostname Traefik routes by
SF_WORKSPACE_DIR ./workspace Bind-mounted to /workspace inside SF
SF_TRAEFIK_HTTP_PORT 80 Host port for Traefik HTTP entrypoint
SF_TRAEFIK_HTTPS_PORT 443 Host port for Traefik HTTPS entrypoint
SF_RPC_SHUTDOWN_GRACE_MS 600000 SF graceful-shutdown drain budget (10 min default). Matches docker-compose.yaml's stop_grace_period: 610s. Operator can shorten via env for fast iteration.

Why a 10-min stop_grace_period?

If a self-feedback queue drain is in flight when SIGTERM lands, it MUST finish before exit. Losing operator/agent feedback writes across an upgrade silently corrupts the queue invariant. The 10-min ceiling handles pathological lock contention; normal drains finish in <1s.

Operator can bypass via docker kill sf-server (sends SIGKILL, trampling the drain) — but that strands .draining files on the sf-state volume. The next container's startup will recover them (see recoverOrphanedFeedbackDrains in packages/coding-agent/src/modes/rpc/feedback-queue-recovery.ts).

TLS / ACME

This compose intentionally exposes only HTTP for local-host demos. For real deployments, add Traefik command flags for the ACME resolver:

command:
  - "--certificatesresolvers.letsencrypt.acme.email=ops@example.com"
  - "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
  - "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"
  - "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"

…and add per-router labels traefik.http.routers.sf.tls.certresolver=letsencrypt.

What this does NOT replace

  • Dockerfile.sf-server — the SF container build is unchanged. This compose consumes the image, not the source.
  • .forgejo/workflows/self-deploy.yml — CI builds, pushes, and rolls k8s deployments. Forgejo's blue-green path uses kubectl rollout, not docker compose. The labels/strategy here are designed to mirror k8s readinessProbe + sessionAffinity for parity.
  • The scripts/upgrade-vega-source-server.mjs script — that script manages the source-server local-dev variant directly via docker run. This compose is for the production-style deployment.