singularity-forge/docs/specs/sf-self-deploy.md
Mikael Hugo 8c945550fa
Some checks are pending
sf self-deploy / build, test, and publish server image (push) Waiting to run
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
feat: operational glue for upgrade-safety chain
Bundles the working-tree state into one coherent commit covering the
upgrade-safety glue that complements today's earlier landings
(orphan-recovery, sf-db single-connection, drain-timer-not-unref'd,
forceShutdown drain, shutdown-state.ts, instrumentation.ts,
shutdown-signal.js, gate-deadlock-classifier).

Modified:
  docker/Dockerfile.source-server — image build tweaks for the source-
    server variant used by the in-container upgrader.
  docker/docker-compose.vega.yaml — env passthroughs for host-side dirs
    (SF_SOURCE_HOST_ROOT, SF_WORKSPACE_HOST_DIR, SF_WORKSPACES_HOST_DIR,
    SF_HOME_HOST_DIR), docker socket mount, group_add for docker GID,
    and SF_RPC_SHUTDOWN_GRACE_MS=600000 matching the 10-min drain.
  scripts/run-vega-source-server.mjs — substantial rework supporting
    the in-container upgrade flow.
  scripts/upgrade-vega-source-server.mjs — buildEnv() + dockerBuildEnv()
    helpers, probeBind via SF_VEGA_PROBE_HOST, containerExists()
    pre-check before drainContainer, stop timeout now matches the
    10-min RPC grace via SF_VEGA_DRAIN_STOP_TIME (default 610s).
  src/web/project-discovery-service.ts — calls
    recoverProjectRuntimeQueues() on each of the 3 discovery paths
    (root monorepo, per-entry, nested SF projects). Closes the
    cloud-volume mtime-lag window codex flagged.
  web/app/api/ready/route.ts — calls recoverProjectRuntimeQueues() on
    every readiness probe, and now also reads shutdown-state so the
    probe returns 503 while draining.
  web/components/sf/projects-view.tsx — UI wiring for the upgrade
    trigger.
  web/pages/api/projects.ts — backend API addition for the project
    enumeration that feeds projects-view.
  docs/specs/sf-self-deploy.md — docs update for the new flow.
  package.json — script alias.

Added:
  scripts/build-web-host.mjs — new build helper for the standalone web
    host artifact consumed by the upgrade flow.
  src/resources/extensions/sf/tests/auto-shutdown-signal.test.mjs —
    unit test for the cooperative-shutdown signal module (registers /
    requests / snapshot).
  src/web/project-runtime-recovery.ts — thin wrapper around
    recoverOrphanedFeedbackDrains for per-project use from web routes.
  web/app/api/drain/route.ts — explicit drain endpoint for operator-
    triggered queue flush.
  web/app/api/server-upgrade/route.ts — auth-gated endpoint that
    spawns the in-container upgrader via docker socket; passes through
    host-dir env so the upgrader knows real bind-mount paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:57:26 +02:00

4.5 KiB

SF Self-Deploy Contract

SF deploys as a long-running server owned by the deployment platform, not by an interactive TUI session. Forgejo is the build authority: it verifies a source revision, publishes an immutable OCI image, then rolls a test server before prod.

Purpose

The server must be reloadable without humans killing old processes by hand, and the CLI/web surfaces must be able to prove which build they are controlling. The artifact boundary is therefore:

  1. source revision in git
  2. Forgejo build/test result
  3. OCI image tag or digest
  4. dist/sf-release-manifest.json
  5. /api/healthz, /api/ready, and /api/version probes

Build Authority

Forgejo runs .forgejo/workflows/self-deploy.yml on main and manual dispatch. The required gates are:

  • npm ci
  • npm --prefix web ci
  • npm run build:core
  • npm run build:web-host
  • npm run typecheck:extensions
  • npm run test:unit
  • build docker/Dockerfile.sf-server
  • generate dist/sf-release-manifest.json

The image builder is Docker/BuildKit. The deployment contract starts at the OCI image plus release manifest.

Server Runtime

The server image starts:

node /opt/sf/dist/loader.js server /workspace --host 0.0.0.0 --port 4000

The web host receives SF_RELEASE_MANIFEST, SF_WEB_PROJECT_CWD, SF_WEB_HOST, and SF_WEB_PORT in its environment. Probes are unauthenticated so Kubernetes, Traefik, and Forgejo can verify rollouts without a browser token.

On vega, the local production server may run from the live checkout while still being containerised:

npm run docker:vega:up

That profile runs one shared SF webserver. It mounts this SF checkout at /opt/sf, mounts the initial controlled repo at /workspace, mounts the repo parent at /workspaces, also mounts the repo parent at its real host path (/home/mhugo/code on vega), persists ~/.sf, and binds port 4000 to ${SF_VEGA_BIND:-127.0.0.1}. SF_WORKSPACE_DIR selects the initial repo; it defaults to this checkout for dogfooding. SF_WORKSPACES_DIR selects the parent directory available for repo switching and defaults to the parent of this SF checkout:

SF_WORKSPACE_DIR=/home/mhugo/code/other-repo SF_WORKSPACES_DIR=/home/mhugo/code npm run docker:vega:up

Set SF_VEGA_BIND to the vega Tailscale address when the server should be reachable over Tailscale; do not bind public 0.0.0.0 unless a proxy/firewall owns access control.

On hosts without the Docker Compose plugin, npm run docker:vega:up uses scripts/run-vega-source-server.mjs to build docker/Dockerfile.source-server and run the equivalent docker run command directly. This is one SF server implementation, one shared webserver process, and repo-scoped worker/session state underneath it. Restarting the runner replaces the shared vega webserver, not one container per repo.

Use npm run docker:vega:upgrade for the local blue/green path. It builds the web host, writes the release manifest, starts sf-server-vega-candidate on port 4001, probes health/readiness/version/projects, replaces sf-server-vega on port 4000 only after the candidate passes, probes prod, then removes the candidate. Replacement drains the old container with docker stop --timeout ${SF_VEGA_DRAIN_STOP_TIME:-610} before forced removal fallback. The default leaves a 10 second margin over the RPC child's SF_RPC_SHUTDOWN_GRACE_MS=600000 queue-drain handler.

Promotion

Test must roll before prod:

  1. set test deployment image to the new digest
  2. wait for rollout
  3. call /api/healthz
  4. call /api/ready
  5. call /api/version
  6. promote the same image digest to prod
  7. repeat the same probes

Prod must not install latest from npm during rollout. Runtime auto-update means the deployment controller rolls a verified image; it does not mean the running process mutates its own package tree.

Reload Model

For a source-mounted vega container, the foreground process is the staged Next standalone server at dist/web/standalone/server.js. Rebuild or restart the container after changing server/web code. In Kubernetes or k3s, rollout replacement is the reload mechanism. Long term, CLI commands should call the server RPC surface by default when a healthy server owns the project, while local sf server remains the bootstrap and recovery path.

Open Work

  • Wire /api/version into the web footer/admin panel.
  • Add an RPC smoke probe once the stable server RPC endpoint is finalized.
  • Move the Forgejo workflow's deployment target names into /srv/infra GitOps values when the cluster manifests exist.