New /infra/srv/ tree: production-style Docker compose that puts Traefik
in front of sf-server. Closes the orchestration gaps the bare-docker
upgrader (scripts/upgrade-vega-source-server.mjs) couldn't address:
1. Health-check-driven drain. Traefik polls /api/healthz every 2s.
The moment SF receives SIGTERM, src/web/shutdown-state.ts flips
the in-process flag and the route returns 503 (landed in
|
||
|---|---|---|
| .. | ||
| docker-compose.yaml | ||
| README.md | ||
SF server infra: Traefik + sf-server with zero-downtime upgrades
Production deployment of sf-server behind a Traefik reverse-proxy. Closes
the orchestration gaps in the bare-docker upgrader (scripts/upgrade-vega- source-server.mjs) by adding:
- Health-check-driven traffic drain. Traefik polls
/api/healthzevery 2s. The moment SF receives SIGTERM,src/web/shutdown-state.tsflips the flag and the route returns 503. After ~4s Traefik removes the container from the load-balancer pool. - Cookie-based sticky sessions.
/api/session/eventsSSE streams survive client reconnects within an upgrade window because Traefik routes the samesf-affcookie to the same replica until that replica is gone. - Blue/green via weighted services. The
sf-candidateservice runs alongsidesfunder a separate Traefik service. Operator flips weights to roll traffic gradually; old container drains; old removed.
Files
| File | Purpose |
|---|---|
docker-compose.yaml |
Traefik + sf + sf-candidate services with full label set |
| (this README) | Operator runbook |
Quick start (local dev / single-host prod)
# 1. Set required env (see `Environment` below)
export SF_IMAGE=ghcr.io/singularity-ng/sf-server:$(git rev-parse HEAD)
export SF_HOSTNAME=sf.localhost # or your real hostname
export SF_WORKSPACE_DIR=/var/lib/sf/workspace
# 2. Bring everything up
docker compose -f infra/srv/docker-compose.yaml up -d
# 3. Sanity check
curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/healthz
curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/ready
curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/version
Zero-downtime upgrade
# 1. Build the new image
export SF_CANDIDATE_IMAGE=ghcr.io/singularity-ng/sf-server:$(git rev-parse HEAD)
docker pull ${SF_CANDIDATE_IMAGE}
# 2. Bring up the candidate (profile=candidate gates it off by default)
docker compose -f infra/srv/docker-compose.yaml --profile candidate up -d sf-candidate
# 3. Verify candidate health BEFORE flipping traffic
docker exec sf-server-candidate curl -fsS http://localhost:4000/api/healthz
docker exec sf-server-candidate curl -fsS http://localhost:4000/api/ready
# 4. Flip Traefik to send traffic to the candidate by promoting it to the
# primary service. The cleanest path is to relabel the candidate's
# routers to match `sf`'s rule, OR use a Traefik weighted-service
# middleware (see https://doc.traefik.io/traefik/routing/services/#weighted-round-robin
# — requires the dynamic-config provider, NOT the docker-labels-only path).
# For now: stop the old, start it as the new with candidate's image.
docker compose -f infra/srv/docker-compose.yaml stop sf
# Traefik now has only the candidate in its pool → traffic flows there.
# 5. Replace `sf` with the new image and start it
SF_IMAGE=${SF_CANDIDATE_IMAGE} \
docker compose -f infra/srv/docker-compose.yaml up -d sf
# 6. Traefik picks up the new `sf` automatically (via docker label
# discovery); both services exist for ~2-4s while health-checks
# converge, then `sf-candidate` can be retired.
docker compose -f infra/srv/docker-compose.yaml --profile candidate down sf-candidate
Environment
| Variable | Default | Purpose |
|---|---|---|
SF_IMAGE |
ghcr.io/singularity-ng/sf-server:latest |
Primary container image |
SF_CANDIDATE_IMAGE |
ghcr.io/singularity-ng/sf-server:candidate |
Blue/green candidate image |
SF_HOSTNAME |
sf.localhost |
Public hostname Traefik routes by |
SF_WORKSPACE_DIR |
./workspace |
Bind-mounted to /workspace inside SF |
SF_TRAEFIK_HTTP_PORT |
80 |
Host port for Traefik HTTP entrypoint |
SF_TRAEFIK_HTTPS_PORT |
443 |
Host port for Traefik HTTPS entrypoint |
SF_RPC_SHUTDOWN_GRACE_MS |
600000 |
SF graceful-shutdown drain budget (10 min default). Matches docker-compose.yaml's stop_grace_period: 610s. Operator can shorten via env for fast iteration. |
Why a 10-min stop_grace_period?
If a self-feedback queue drain is in flight when SIGTERM lands, it MUST finish before exit. Losing operator/agent feedback writes across an upgrade silently corrupts the queue invariant. The 10-min ceiling handles pathological lock contention; normal drains finish in <1s.
Operator can bypass via docker kill sf-server (sends SIGKILL,
trampling the drain) — but that strands .draining files on the
sf-state volume. The next container's startup will recover them
(see recoverOrphanedFeedbackDrains in
packages/coding-agent/src/modes/rpc/feedback-queue-recovery.ts).
TLS / ACME
This compose intentionally exposes only HTTP for local-host demos. For real deployments, add Traefik command flags for the ACME resolver:
command:
- "--certificatesresolvers.letsencrypt.acme.email=ops@example.com"
- "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
- "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"
- "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"
…and add per-router labels traefik.http.routers.sf.tls.certresolver=letsencrypt.
What this does NOT replace
Dockerfile.sf-server— the SF container build is unchanged. This compose consumes the image, not the source..forgejo/workflows/self-deploy.yml— CI builds, pushes, and rolls k8s deployments. Forgejo's blue-green path useskubectl rollout, not docker compose. The labels/strategy here are designed to mirror k8s readinessProbe + sessionAffinity for parity.- The
scripts/upgrade-vega-source-server.mjsscript — that script manages the source-server local-dev variant directly via docker run. This compose is for the production-style deployment.