revert(infra/srv): remove wrong-primitive Traefik docker-compose

This commit removes infra/srv/ that I created in d23b99819. The
docker-compose-Traefik sketch was architecturally wrong:

- Traefik on this host is a Flux-managed Kubernetes DaemonSet at
  /srv/infra/clusters/default/infrastructure/traefik/helmrelease.yaml
  (hostNetwork: true, ports 80/443/18789/2222)
- Vega's k3s explicitly disables its own bundled Traefik
  (--disable=traefik,servicelb,metrics-server) and relies on the
  Flux-managed one
- So the correct Traefik integration for sf-server is k8s
  IngressRoute + Service + Deployment manifests under
  /srv/infra/apps/ or hosts/vega/, NOT a docker-compose stack in
  the SF source tree

The sf-server Docker image (docker/Dockerfile.sf-server) and the
production-grade graceful-shutdown/recovery work in
packages/coding-agent/src/modes/rpc/ + src/web/shutdown-state.ts
all remain valid and necessary — they just plug into k8s/Traefik
via manifests in the operator's GitOps repo, not via this compose.

Naming: also moved infra/srv -> docker/vega briefly during this
session at the operator's nudging; both locations are gone now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mikael Hugo 2026-05-17 22:45:31 +02:00
parent d23b998194
commit 40c6148d7e
2 changed files with 0 additions and 275 deletions

View file

@ -1,125 +0,0 @@
# SF server infra: Traefik + sf-server with zero-downtime upgrades
Production deployment of `sf-server` behind a Traefik reverse-proxy. Closes
the orchestration gaps in the bare-docker upgrader (`scripts/upgrade-vega-
source-server.mjs`) by adding:
- **Health-check-driven traffic drain.** Traefik polls `/api/healthz` every
2s. The moment SF receives SIGTERM, `src/web/shutdown-state.ts` flips the
flag and the route returns 503. After ~4s Traefik removes the container
from the load-balancer pool.
- **Cookie-based sticky sessions.** `/api/session/events` SSE streams survive
client reconnects within an upgrade window because Traefik routes the
same `sf-aff` cookie to the same replica until that replica is gone.
- **Blue/green via weighted services.** The `sf-candidate` service runs
alongside `sf` under a separate Traefik service. Operator flips weights
to roll traffic gradually; old container drains; old removed.
## Files
| File | Purpose |
|------|---------|
| `docker-compose.yaml` | Traefik + sf + sf-candidate services with full label set |
| (this README) | Operator runbook |
## Quick start (local dev / single-host prod)
```bash
# 1. Set required env (see `Environment` below)
export SF_IMAGE=ghcr.io/singularity-ng/sf-server:$(git rev-parse HEAD)
export SF_HOSTNAME=sf.localhost # or your real hostname
export SF_WORKSPACE_DIR=/var/lib/sf/workspace
# 2. Bring everything up
docker compose -f infra/srv/docker-compose.yaml up -d
# 3. Sanity check
curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/healthz
curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/ready
curl -H "Host: ${SF_HOSTNAME}" http://localhost/api/version
```
## Zero-downtime upgrade
```bash
# 1. Build the new image
export SF_CANDIDATE_IMAGE=ghcr.io/singularity-ng/sf-server:$(git rev-parse HEAD)
docker pull ${SF_CANDIDATE_IMAGE}
# 2. Bring up the candidate (profile=candidate gates it off by default)
docker compose -f infra/srv/docker-compose.yaml --profile candidate up -d sf-candidate
# 3. Verify candidate health BEFORE flipping traffic
docker exec sf-server-candidate curl -fsS http://localhost:4000/api/healthz
docker exec sf-server-candidate curl -fsS http://localhost:4000/api/ready
# 4. Flip Traefik to send traffic to the candidate by promoting it to the
# primary service. The cleanest path is to relabel the candidate's
# routers to match `sf`'s rule, OR use a Traefik weighted-service
# middleware (see https://doc.traefik.io/traefik/routing/services/#weighted-round-robin
# — requires the dynamic-config provider, NOT the docker-labels-only path).
# For now: stop the old, start it as the new with candidate's image.
docker compose -f infra/srv/docker-compose.yaml stop sf
# Traefik now has only the candidate in its pool → traffic flows there.
# 5. Replace `sf` with the new image and start it
SF_IMAGE=${SF_CANDIDATE_IMAGE} \
docker compose -f infra/srv/docker-compose.yaml up -d sf
# 6. Traefik picks up the new `sf` automatically (via docker label
# discovery); both services exist for ~2-4s while health-checks
# converge, then `sf-candidate` can be retired.
docker compose -f infra/srv/docker-compose.yaml --profile candidate down sf-candidate
```
## Environment
| Variable | Default | Purpose |
|---|---|---|
| `SF_IMAGE` | `ghcr.io/singularity-ng/sf-server:latest` | Primary container image |
| `SF_CANDIDATE_IMAGE` | `ghcr.io/singularity-ng/sf-server:candidate` | Blue/green candidate image |
| `SF_HOSTNAME` | `sf.localhost` | Public hostname Traefik routes by |
| `SF_WORKSPACE_DIR` | `./workspace` | Bind-mounted to `/workspace` inside SF |
| `SF_TRAEFIK_HTTP_PORT` | `80` | Host port for Traefik HTTP entrypoint |
| `SF_TRAEFIK_HTTPS_PORT` | `443` | Host port for Traefik HTTPS entrypoint |
| `SF_RPC_SHUTDOWN_GRACE_MS` | `600000` | SF graceful-shutdown drain budget (10 min default). Matches `docker-compose.yaml`'s `stop_grace_period: 610s`. Operator can shorten via env for fast iteration. |
## Why a 10-min stop_grace_period?
If a self-feedback queue drain is in flight when SIGTERM lands, it MUST
finish before exit. Losing operator/agent feedback writes across an
upgrade silently corrupts the queue invariant. The 10-min ceiling
handles pathological lock contention; normal drains finish in <1s.
Operator can bypass via `docker kill sf-server` (sends SIGKILL,
trampling the drain) — but that strands `.draining` files on the
`sf-state` volume. The next container's startup will recover them
(see `recoverOrphanedFeedbackDrains` in
`packages/coding-agent/src/modes/rpc/feedback-queue-recovery.ts`).
## TLS / ACME
This compose intentionally exposes only HTTP for local-host demos.
For real deployments, add Traefik command flags for the ACME resolver:
```yaml
command:
- "--certificatesresolvers.letsencrypt.acme.email=ops@example.com"
- "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
- "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"
- "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"
```
…and add per-router labels `traefik.http.routers.sf.tls.certresolver=letsencrypt`.
## What this does NOT replace
- `Dockerfile.sf-server` — the SF container build is unchanged. This
compose consumes the image, not the source.
- `.forgejo/workflows/self-deploy.yml` — CI builds, pushes, and rolls
k8s deployments. Forgejo's blue-green path uses `kubectl rollout`,
not docker compose. The labels/strategy here are designed to mirror
k8s readinessProbe + sessionAffinity for parity.
- The `scripts/upgrade-vega-source-server.mjs` script — that script
manages the source-server local-dev variant directly via docker run.
This compose is for the production-style deployment.

View file

@ -1,150 +0,0 @@
name: sf-srv
# SF self-hosted production deployment, fronted by Traefik for:
# - health-check-driven traffic draining (consumes /api/healthz 503 during
# graceful shutdown so old containers stop receiving new traffic the
# instant SIGTERM lands — see src/web/shutdown-state.ts)
# - cookie-based sticky sessions so /api/session/events SSE streams survive
# re-issued requests within an upgrade
# - zero-downtime blue/green via weighted services (candidate gets weight=0
# until probes pass, then weights flip; old container drains; old removed)
#
# Volumes:
# sf-state — persistent .sf/ runtime (queues, DB, drainer recovery files).
# Mounted at /workspace/.sf in each SF container. Survives
# container swaps so queued sf_feedback writes are durable
# across upgrades.
# traefik-acme — ACME cert cache (only used when SF_TRAEFIK_TLS=1)
#
# Bring up:
# docker compose -f infra/srv/docker-compose.yaml up -d
#
# Upgrade (manual blue/green; see infra/srv/README.md for the scripted flow):
# 1. docker compose -f infra/srv/docker-compose.yaml --profile candidate up -d
# 2. curl http://localhost/api/healthz (Traefik health-checks the new svc)
# 3. flip weights: edit sf-candidate label to 100, sf to 0; restart Traefik
# 4. wait for sf-old to drain (healthz 503 → Traefik removes from pool)
# 5. docker compose -f infra/srv/docker-compose.yaml stop sf
services:
traefik:
image: traefik:v3.3
container_name: sf-traefik
restart: unless-stopped
command:
- "--api.dashboard=false"
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
- "--providers.docker.network=sf-srv-net"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
# Polling health-check interval is set per-service via labels.
# See traefik.http.services.sf.loadbalancer.healthcheck.* below.
ports:
- "${SF_TRAEFIK_HTTP_PORT:-80}:80"
- "${SF_TRAEFIK_HTTPS_PORT:-443}:443"
volumes:
- "/var/run/docker.sock:/var/run/docker.sock:ro"
- "traefik-acme:/letsencrypt"
networks:
- sf-srv-net
healthcheck:
test: ["CMD", "wget", "--quiet", "--spider", "http://localhost:80/ping"]
interval: 10s
timeout: 3s
retries: 3
sf:
image: "${SF_IMAGE:-ghcr.io/singularity-ng/sf-server:latest}"
container_name: sf-server
restart: unless-stopped
# k8s default terminationGracePeriodSeconds is 30s; we override here to
# match rpc-mode's SF_RPC_SHUTDOWN_GRACE_MS default (10 min = 600s).
# The graceful-shutdown handler in packages/coding-agent/src/modes/rpc/
# rpc-mode.ts must finish its drain before SIGKILL — losing self-feedback
# writes across an upgrade is worse than the wait.
stop_grace_period: 610s
environment:
- "SF_RPC_SHUTDOWN_GRACE_MS=${SF_RPC_SHUTDOWN_GRACE_MS:-600000}"
- "SF_WEB_HOST=0.0.0.0"
- "SF_WEB_PORT=4000"
volumes:
- "sf-state:/workspace/.sf"
- "${SF_WORKSPACE_DIR:-./workspace}:/workspace:rw"
networks:
- sf-srv-net
labels:
# Route discovery
- "traefik.enable=true"
- "traefik.docker.network=sf-srv-net"
- "traefik.http.routers.sf.rule=Host(`${SF_HOSTNAME:-sf.localhost}`)"
- "traefik.http.routers.sf.entrypoints=web"
- "traefik.http.routers.sf.service=sf"
# Backend port
- "traefik.http.services.sf.loadbalancer.server.port=4000"
# Health-check: drives shutdown-aware draining. The healthz route
# returns 503 the moment src/web/shutdown-state.ts.isShuttingDown()
# flips true (SIGTERM/SIGINT/SIGHUP received). Traefik polls every
# 2s; once 2 consecutive failures land (~4s after SIGTERM), the
# container is removed from the pool and no new requests are sent.
# Existing requests finish (subject to the timeout below).
- "traefik.http.services.sf.loadbalancer.healthcheck.path=/api/healthz"
- "traefik.http.services.sf.loadbalancer.healthcheck.interval=2s"
- "traefik.http.services.sf.loadbalancer.healthcheck.timeout=3s"
# Sticky session: required for /api/session/events SSE streams to
# survive client reconnects within the same upgrade window. Cookie
# is HttpOnly + Secure-when-TLS-fronted. Affinity is per-replica;
# when a container goes away, the cookie targets disappear and
# Traefik routes the next request to a healthy peer.
- "traefik.http.services.sf.loadbalancer.sticky.cookie=true"
- "traefik.http.services.sf.loadbalancer.sticky.cookie.name=sf-aff"
- "traefik.http.services.sf.loadbalancer.sticky.cookie.httpOnly=true"
- "traefik.http.services.sf.loadbalancer.sticky.cookie.secure=false"
- "traefik.http.services.sf.loadbalancer.sticky.cookie.sameSite=lax"
# Candidate replica for blue/green upgrades.
#
# Default weight = 0 so production traffic stays on `sf` until probes pass.
# Operator flips weights via the upgrader script (see ../../scripts/upgrade-
# vega-source-server.mjs and the README in this dir for the full flow).
sf-candidate:
image: "${SF_CANDIDATE_IMAGE:-ghcr.io/singularity-ng/sf-server:candidate}"
container_name: sf-server-candidate
restart: unless-stopped
profiles: ["candidate"]
stop_grace_period: 610s
environment:
- "SF_RPC_SHUTDOWN_GRACE_MS=${SF_RPC_SHUTDOWN_GRACE_MS:-600000}"
- "SF_WEB_HOST=0.0.0.0"
- "SF_WEB_PORT=4000"
volumes:
- "sf-state:/workspace/.sf"
- "${SF_WORKSPACE_DIR:-./workspace}:/workspace:rw"
networks:
- sf-srv-net
labels:
- "traefik.enable=true"
- "traefik.docker.network=sf-srv-net"
- "traefik.http.routers.sf-candidate.rule=Host(`${SF_HOSTNAME:-sf.localhost}`)"
- "traefik.http.routers.sf-candidate.entrypoints=web"
- "traefik.http.routers.sf-candidate.service=sf-candidate@docker"
- "traefik.http.services.sf-candidate.loadbalancer.server.port=4000"
- "traefik.http.services.sf-candidate.loadbalancer.healthcheck.path=/api/healthz"
- "traefik.http.services.sf-candidate.loadbalancer.healthcheck.interval=2s"
- "traefik.http.services.sf-candidate.loadbalancer.healthcheck.timeout=3s"
- "traefik.http.services.sf-candidate.loadbalancer.sticky.cookie=true"
- "traefik.http.services.sf-candidate.loadbalancer.sticky.cookie.name=sf-aff-candidate"
volumes:
sf-state:
driver: local
traefik-acme:
driver: local
networks:
sf-srv-net:
driver: bridge
name: sf-srv-net