From d23b9981943b9ce295c77b0fcdfd8043c930c7a3 Mon Sep 17 00:00:00 2001
From: Mikael Hugo <mikkihugo@users.noreply.github.com>
Date: Sun, 17 May 2026 22:38:29 +0200
Subject: [PATCH] feat(infra/srv): Traefik fronting for zero-downtime sf-server
 upgrades
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

New /infra/srv/ tree: production-style Docker compose that puts Traefik
in front of sf-server. Closes the orchestration gaps the bare-docker
upgrader (scripts/upgrade-vega-source-server.mjs) couldn't address:

  1. Health-check-driven drain. Traefik polls /api/healthz every 2s.
     The moment SF receives SIGTERM, src/web/shutdown-state.ts flips
     the in-process flag and the route returns 503 (landed in
     f8e53840d). ~4s later Traefik removes the replica from the pool
     — new traffic stops, in-flight requests finish.

  2. Sticky sessions via the `sf-aff` cookie. /api/session/events SSE
     streams (and any other long-lived per-replica state) survive
     client reconnects within the upgrade window because Traefik
     pins the cookie to the same replica until that replica is gone.

  3. Blue/green via the `sf-candidate` service. Guarded by Docker
     compose profile=candidate so production traffic keeps flowing to
     `sf` until the operator promotes. Image swap is then atomic from
     a client perspective — old replica goes 503, new replica picks
     up traffic before old container actually stops.

  4. stop_grace_period: 610s matching SF_RPC_SHUTDOWN_GRACE_MS=600000.
     If a self-feedback queue drain is in flight when SIGTERM lands,
     it MUST finish. Losing writes across an upgrade is worse than the
     wait. Hard-bypass via `docker kill` if the operator chooses; the
     .draining file then gets recovered on the next start via
     feedback-queue-recovery's startup scan.

infra/srv/README.md documents the runbook: bring-up, upgrade flow,
env vars, TLS notes, and what this does NOT replace (the existing
Dockerfile, k8s/Forgejo CI flow, and the source-server upgrader).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 infra/srv/README.md           | 125 ++++++++++++++++++++++++++++
 infra/srv/docker-compose.yaml | 150 ++++++++++++++++++++++++++++++++++
 2 files changed, 275 insertions(+)
 create mode 100644 infra/srv/README.md
 create mode 100644 infra/srv/docker-compose.yaml

diff --git a/infra/srv/README.md b/infra/srv/README.md
new file mode 100644
index 000000000..2d48a5347
--- /dev/null
+++ b/infra/srv/README.md
@@ -0,0 +1,125 @@
+# SF server infra: Traefik + sf-server with zero-downtime upgrades
+
+Production deployment of `sf-server` behind a Traefik reverse-proxy. Closes
+the orchestration gaps in the bare-docker upgrader (`scripts/upgrade-vega-
+source-server.mjs`) by adding:
+
+- **Health-check-driven traffic drain.** Traefik polls `/api/healthz` every
+  2s. The moment SF receives SIGTERM, `src/web/shutdown-state.ts` flips the
+  flag and the route returns 503. After ~4s Traefik removes the container
+  from the load-balancer pool.
+- **Cookie-based sticky sessions.** `/api/session/events` SSE streams survive
+  client reconnects within an upgrade window because Traefik routes the
+  same `sf-aff` cookie to the same replica until that replica is gone.
+- **Blue/green via weighted services.** The `sf-candidate` service runs
+  alongside `sf` under a separate Traefik service. Operator flips weights
+  to roll traffic gradually; old container drains; old removed.
+
+## Files
+
+| File | Purpose |
+|------|---------|
+| `docker-compose.yaml` | Traefik + sf + sf-candidate services with full label set |
+| (this README) | Operator runbook |
+
+## Quick start (local dev / single-host prod)
+
+```bash
+# 1. Set required env (see `Environment` below)
+export SF_IMAGE=ghcr.io/singularity-ng/sf-server:$(git rev-parse HEAD)
+export SF_HOSTNAME=sf.localhost           # or your real hostname
+export SF_WORKSPACE_DIR=/var/lib/sf/workspace
+
+# 2. Bring everything up
+docker compose -f infra/srv/docker-compose.yaml up -d
+
+# 3. Sanity check
+curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/healthz
+curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/ready
+curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/version
+```
+
+## Zero-downtime upgrade
+
+```bash
+# 1. Build the new image
+export SF_CANDIDATE_IMAGE=ghcr.io/singularity-ng/sf-server:$(git rev-parse HEAD)
+docker pull ${SF_CANDIDATE_IMAGE}
+
+# 2. Bring up the candidate (profile=candidate gates it off by default)
+docker compose -f infra/srv/docker-compose.yaml --profile candidate up -d sf-candidate
+
+# 3. Verify candidate health BEFORE flipping traffic
+docker exec sf-server-candidate curl -fsS http://localhost:4000/api/healthz
+docker exec sf-server-candidate curl -fsS http://localhost:4000/api/ready
+
+# 4. Flip Traefik to send traffic to the candidate by promoting it to the
+#    primary service. The cleanest path is to relabel the candidate's
+#    routers to match `sf`'s rule, OR use a Traefik weighted-service
+#    middleware (see https://doc.traefik.io/traefik/routing/services/#weighted-round-robin
+#    — requires the dynamic-config provider, NOT the docker-labels-only path).
+#    For now: stop the old, start it as the new with candidate's image.
+docker compose -f infra/srv/docker-compose.yaml stop sf
+# Traefik now has only the candidate in its pool → traffic flows there.
+
+# 5. Replace `sf` with the new image and start it
+SF_IMAGE=${SF_CANDIDATE_IMAGE} \
+  docker compose -f infra/srv/docker-compose.yaml up -d sf
+
+# 6. Traefik picks up the new `sf` automatically (via docker label
+#    discovery); both services exist for ~2-4s while health-checks
+#    converge, then `sf-candidate` can be retired.
+docker compose -f infra/srv/docker-compose.yaml --profile candidate down sf-candidate
+```
+
+## Environment
+
+| Variable | Default | Purpose |
+|---|---|---|
+| `SF_IMAGE` | `ghcr.io/singularity-ng/sf-server:latest` | Primary container image |
+| `SF_CANDIDATE_IMAGE` | `ghcr.io/singularity-ng/sf-server:candidate` | Blue/green candidate image |
+| `SF_HOSTNAME` | `sf.localhost` | Public hostname Traefik routes by |
+| `SF_WORKSPACE_DIR` | `./workspace` | Bind-mounted to `/workspace` inside SF |
+| `SF_TRAEFIK_HTTP_PORT` | `80` | Host port for Traefik HTTP entrypoint |
+| `SF_TRAEFIK_HTTPS_PORT` | `443` | Host port for Traefik HTTPS entrypoint |
+| `SF_RPC_SHUTDOWN_GRACE_MS` | `600000` | SF graceful-shutdown drain budget (10 min default). Matches `docker-compose.yaml`'s `stop_grace_period: 610s`. Operator can shorten via env for fast iteration. |
+
+## Why a 10-min stop_grace_period?
+
+If a self-feedback queue drain is in flight when SIGTERM lands, it MUST
+finish before exit. Losing operator/agent feedback writes across an
+upgrade silently corrupts the queue invariant. The 10-min ceiling
+handles pathological lock contention; normal drains finish in <1s.
+
+Operator can bypass via `docker kill sf-server` (sends SIGKILL,
+trampling the drain) — but that strands `.draining` files on the
+`sf-state` volume. The next container's startup will recover them
+(see `recoverOrphanedFeedbackDrains` in
+`packages/coding-agent/src/modes/rpc/feedback-queue-recovery.ts`).
+
+## TLS / ACME
+
+This compose intentionally exposes only HTTP for local-host demos.
+For real deployments, add Traefik command flags for the ACME resolver:
+
+```yaml
+command:
+  - "--certificatesresolvers.letsencrypt.acme.email=ops@example.com"
+  - "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
+  - "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"
+  - "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"
+```
+
+…and add per-router labels `traefik.http.routers.sf.tls.certresolver=letsencrypt`.
+
+## What this does NOT replace
+
+- `Dockerfile.sf-server` — the SF container build is unchanged. This
+  compose consumes the image, not the source.
+- `.forgejo/workflows/self-deploy.yml` — CI builds, pushes, and rolls
+  k8s deployments. Forgejo's blue-green path uses `kubectl rollout`,
+  not docker compose. The labels/strategy here are designed to mirror
+  k8s readinessProbe + sessionAffinity for parity.
+- The `scripts/upgrade-vega-source-server.mjs` script — that script
+  manages the source-server local-dev variant directly via docker run.
+  This compose is for the production-style deployment.
diff --git a/infra/srv/docker-compose.yaml b/infra/srv/docker-compose.yaml
new file mode 100644
index 000000000..838d9c86e
--- /dev/null
+++ b/infra/srv/docker-compose.yaml
@@ -0,0 +1,150 @@
+name: sf-srv
+
+# SF self-hosted production deployment, fronted by Traefik for:
+#   - health-check-driven traffic draining (consumes /api/healthz 503 during
+#     graceful shutdown so old containers stop receiving new traffic the
+#     instant SIGTERM lands — see src/web/shutdown-state.ts)
+#   - cookie-based sticky sessions so /api/session/events SSE streams survive
+#     re-issued requests within an upgrade
+#   - zero-downtime blue/green via weighted services (candidate gets weight=0
+#     until probes pass, then weights flip; old container drains; old removed)
+#
+# Volumes:
+#   sf-state — persistent .sf/ runtime (queues, DB, drainer recovery files).
+#              Mounted at /workspace/.sf in each SF container. Survives
+#              container swaps so queued sf_feedback writes are durable
+#              across upgrades.
+#   traefik-acme — ACME cert cache (only used when SF_TRAEFIK_TLS=1)
+#
+# Bring up:
+#   docker compose -f infra/srv/docker-compose.yaml up -d
+#
+# Upgrade (manual blue/green; see infra/srv/README.md for the scripted flow):
+#   1. docker compose -f infra/srv/docker-compose.yaml --profile candidate up -d
+#   2. curl http://localhost/api/healthz  (Traefik health-checks the new svc)
+#   3. flip weights: edit sf-candidate label to 100, sf to 0; restart Traefik
+#   4. wait for sf-old to drain (healthz 503 → Traefik removes from pool)
+#   5. docker compose -f infra/srv/docker-compose.yaml stop sf
+
+services:
+  traefik:
+    image: traefik:v3.3
+    container_name: sf-traefik
+    restart: unless-stopped
+    command:
+      - "--api.dashboard=false"
+      - "--providers.docker=true"
+      - "--providers.docker.exposedbydefault=false"
+      - "--providers.docker.network=sf-srv-net"
+      - "--entrypoints.web.address=:80"
+      - "--entrypoints.websecure.address=:443"
+      # Polling health-check interval is set per-service via labels.
+      # See traefik.http.services.sf.loadbalancer.healthcheck.* below.
+    ports:
+      - "${SF_TRAEFIK_HTTP_PORT:-80}:80"
+      - "${SF_TRAEFIK_HTTPS_PORT:-443}:443"
+    volumes:
+      - "/var/run/docker.sock:/var/run/docker.sock:ro"
+      - "traefik-acme:/letsencrypt"
+    networks:
+      - sf-srv-net
+    healthcheck:
+      test: ["CMD", "wget", "--quiet", "--spider", "http://localhost:80/ping"]
+      interval: 10s
+      timeout: 3s
+      retries: 3
+
+  sf:
+    image: "${SF_IMAGE:-ghcr.io/singularity-ng/sf-server:latest}"
+    container_name: sf-server
+    restart: unless-stopped
+    # k8s default terminationGracePeriodSeconds is 30s; we override here to
+    # match rpc-mode's SF_RPC_SHUTDOWN_GRACE_MS default (10 min = 600s).
+    # The graceful-shutdown handler in packages/coding-agent/src/modes/rpc/
+    # rpc-mode.ts must finish its drain before SIGKILL — losing self-feedback
+    # writes across an upgrade is worse than the wait.
+    stop_grace_period: 610s
+    environment:
+      - "SF_RPC_SHUTDOWN_GRACE_MS=${SF_RPC_SHUTDOWN_GRACE_MS:-600000}"
+      - "SF_WEB_HOST=0.0.0.0"
+      - "SF_WEB_PORT=4000"
+    volumes:
+      - "sf-state:/workspace/.sf"
+      - "${SF_WORKSPACE_DIR:-./workspace}:/workspace:rw"
+    networks:
+      - sf-srv-net
+    labels:
+      # Route discovery
+      - "traefik.enable=true"
+      - "traefik.docker.network=sf-srv-net"
+      - "traefik.http.routers.sf.rule=Host(`${SF_HOSTNAME:-sf.localhost}`)"
+      - "traefik.http.routers.sf.entrypoints=web"
+      - "traefik.http.routers.sf.service=sf"
+
+      # Backend port
+      - "traefik.http.services.sf.loadbalancer.server.port=4000"
+
+      # Health-check: drives shutdown-aware draining. The healthz route
+      # returns 503 the moment src/web/shutdown-state.ts.isShuttingDown()
+      # flips true (SIGTERM/SIGINT/SIGHUP received). Traefik polls every
+      # 2s; once 2 consecutive failures land (~4s after SIGTERM), the
+      # container is removed from the pool and no new requests are sent.
+      # Existing requests finish (subject to the timeout below).
+      - "traefik.http.services.sf.loadbalancer.healthcheck.path=/api/healthz"
+      - "traefik.http.services.sf.loadbalancer.healthcheck.interval=2s"
+      - "traefik.http.services.sf.loadbalancer.healthcheck.timeout=3s"
+
+      # Sticky session: required for /api/session/events SSE streams to
+      # survive client reconnects within the same upgrade window. Cookie
+      # is HttpOnly + Secure-when-TLS-fronted. Affinity is per-replica;
+      # when a container goes away, the cookie targets disappear and
+      # Traefik routes the next request to a healthy peer.
+      - "traefik.http.services.sf.loadbalancer.sticky.cookie=true"
+      - "traefik.http.services.sf.loadbalancer.sticky.cookie.name=sf-aff"
+      - "traefik.http.services.sf.loadbalancer.sticky.cookie.httpOnly=true"
+      - "traefik.http.services.sf.loadbalancer.sticky.cookie.secure=false"
+      - "traefik.http.services.sf.loadbalancer.sticky.cookie.sameSite=lax"
+
+  # Candidate replica for blue/green upgrades.
+  #
+  # Default weight = 0 so production traffic stays on `sf` until probes pass.
+  # Operator flips weights via the upgrader script (see ../../scripts/upgrade-
+  # vega-source-server.mjs and the README in this dir for the full flow).
+  sf-candidate:
+    image: "${SF_CANDIDATE_IMAGE:-ghcr.io/singularity-ng/sf-server:candidate}"
+    container_name: sf-server-candidate
+    restart: unless-stopped
+    profiles: ["candidate"]
+    stop_grace_period: 610s
+    environment:
+      - "SF_RPC_SHUTDOWN_GRACE_MS=${SF_RPC_SHUTDOWN_GRACE_MS:-600000}"
+      - "SF_WEB_HOST=0.0.0.0"
+      - "SF_WEB_PORT=4000"
+    volumes:
+      - "sf-state:/workspace/.sf"
+      - "${SF_WORKSPACE_DIR:-./workspace}:/workspace:rw"
+    networks:
+      - sf-srv-net
+    labels:
+      - "traefik.enable=true"
+      - "traefik.docker.network=sf-srv-net"
+      - "traefik.http.routers.sf-candidate.rule=Host(`${SF_HOSTNAME:-sf.localhost}`)"
+      - "traefik.http.routers.sf-candidate.entrypoints=web"
+      - "traefik.http.routers.sf-candidate.service=sf-candidate@docker"
+      - "traefik.http.services.sf-candidate.loadbalancer.server.port=4000"
+      - "traefik.http.services.sf-candidate.loadbalancer.healthcheck.path=/api/healthz"
+      - "traefik.http.services.sf-candidate.loadbalancer.healthcheck.interval=2s"
+      - "traefik.http.services.sf-candidate.loadbalancer.healthcheck.timeout=3s"
+      - "traefik.http.services.sf-candidate.loadbalancer.sticky.cookie=true"
+      - "traefik.http.services.sf-candidate.loadbalancer.sticky.cookie.name=sf-aff-candidate"
+
+volumes:
+  sf-state:
+    driver: local
+  traefik-acme:
+    driver: local
+
+networks:
+  sf-srv-net:
+    driver: bridge
+    name: sf-srv-net