Now superseded by k3s self-deploy: build → push → kubectl set image
performs rolling rollout, so the in-band docker-compose-on-vega upgrade
path (docker:vega:* scripts, /api/server-upgrade route, Dockerfile.source-server,
docker-compose.vega.yaml, projects-view "Upgrade Server" button) is dead
code.
The k3s deploy workflow (.forgejo/workflows/self-deploy.yml) and sf-server
kustomization under /srv/infra/clusters/default/tenants/hugo/apps/sf-server/
are the only deploy path going forward.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cycle 2 (the 13-node coding-agent mega) closed via two changes:
1. scripts/check-circular-deps.mjs — track function-body depth and
skip require()/import() calls inside function bodies. They run on
call, not at module evaluation, and therefore cannot cause
module-graph cycles — same reasoning as the existing dynamic
`await import()` skip. Generic improvement; benefits any pattern
that uses lazy CommonJS require() to break a static cycle.
2. packages/coding-agent/src/core/extensions/loader.ts — removed the
static `import * as _bundledCodingAgent from "../../index.js"`
self-reference, which was the cycle-closer. It only populated
STATIC_BUNDLED_MODULES for the Bun virtualModules path
(`isBunBinary` branch in getJitiOptions), and SF is Node-26-only
per operator policy (no Bun) — so the Bun branch is dead at
runtime and dropping the static self-reference is safe. The two
map entries that referenced it (@singularity-forge/coding-agent
and the @mariozechner alias) are commented out at the same site
with a pointer to the top-of-file note.
Net effect across the full session:
start of session: 9 cycles
walker false-positive
cleanups landed: dropped 6 type-only + dynamic-import false
positives
tui ↔ overlay-layout: CURSOR_MARKER moved to overlay-types.ts
SF autonomous-rollback
chain (3 targeted
cuts): experimental → preferences-serializer,
classifier → lazy rollback import,
preferences-models → runaway-defaults.js
this commit: coding-agent loader self-reference dropped
Final state: ✅ zero circular dependencies in 1193 scanned files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles the working-tree state into one coherent commit covering the
upgrade-safety glue that complements today's earlier landings
(orphan-recovery, sf-db single-connection, drain-timer-not-unref'd,
forceShutdown drain, shutdown-state.ts, instrumentation.ts,
shutdown-signal.js, gate-deadlock-classifier).
Modified:
docker/Dockerfile.source-server — image build tweaks for the source-
server variant used by the in-container upgrader.
docker/docker-compose.vega.yaml — env passthroughs for host-side dirs
(SF_SOURCE_HOST_ROOT, SF_WORKSPACE_HOST_DIR, SF_WORKSPACES_HOST_DIR,
SF_HOME_HOST_DIR), docker socket mount, group_add for docker GID,
and SF_RPC_SHUTDOWN_GRACE_MS=600000 matching the 10-min drain.
scripts/run-vega-source-server.mjs — substantial rework supporting
the in-container upgrade flow.
scripts/upgrade-vega-source-server.mjs — buildEnv() + dockerBuildEnv()
helpers, probeBind via SF_VEGA_PROBE_HOST, containerExists()
pre-check before drainContainer, stop timeout now matches the
10-min RPC grace via SF_VEGA_DRAIN_STOP_TIME (default 610s).
src/web/project-discovery-service.ts — calls
recoverProjectRuntimeQueues() on each of the 3 discovery paths
(root monorepo, per-entry, nested SF projects). Closes the
cloud-volume mtime-lag window codex flagged.
web/app/api/ready/route.ts — calls recoverProjectRuntimeQueues() on
every readiness probe, and now also reads shutdown-state so the
probe returns 503 while draining.
web/components/sf/projects-view.tsx — UI wiring for the upgrade
trigger.
web/pages/api/projects.ts — backend API addition for the project
enumeration that feeds projects-view.
docs/specs/sf-self-deploy.md — docs update for the new flow.
package.json — script alias.
Added:
scripts/build-web-host.mjs — new build helper for the standalone web
host artifact consumed by the upgrade flow.
src/resources/extensions/sf/tests/auto-shutdown-signal.test.mjs —
unit test for the cooperative-shutdown signal module (registers /
requests / snapshot).
src/web/project-runtime-recovery.ts — thin wrapper around
recoverOrphanedFeedbackDrains for per-project use from web routes.
web/app/api/drain/route.ts — explicit drain endpoint for operator-
triggered queue flush.
web/app/api/server-upgrade/route.ts — auth-gated endpoint that
spawns the in-container upgrader via docker socket; passes through
host-dir env so the upgrader knows real bind-mount paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related changes to make blue/green upgrades (per scripts/upgrade-vega-
source-server.mjs) safe for in-flight self-feedback writes.
1. Startup orphan recovery (feedback-queue-recovery.ts, extracted module).
Scans .sf/runtime/ for sf-feedback-queue.jsonl.<pid>(.<sid>)?.draining
files left by previous processes. For each:
- if our own session id: leave alone (live drain)
- if PID is alive: leave alone (foreign drainer)
- else: rename back to queue (only if no active queue file exists)
Crash safety: when both an orphan AND an active queue exist, we DEFER
recovery rather than merge — appending then unlinking would risk
duplicate replay on crash. The next restart's recovery picks it up
once the queue is naturally drained. Supports legacy filenames
(.<pid>.draining, pre-session-id) for backward compat.
Added SF_DRAIN_SESSION_ID (per-process 6-byte hex) stamped into the
.draining filename. PID reuse across container restarts is normally
safe because /proc clears, but the session id is a stronger guarantee
that we don't trample a foreign drainer that happens to land on the
same PID.
2. SIGTERM/SIGINT drain-then-exit handler (installGracefulShutdown).
Drains the queue once on signal, then exits. Bounded by
SF_RPC_SHUTDOWN_GRACE_MS (default 600_000 = 10 min). Rationale: if
a drain is in flight, it MUST finish — losing self-feedback writes
across a server upgrade is worse than a long wait. Normal drains
complete in <1s; the 10-min ceiling is for pathological lock
contention. Operator overrides via env var, or docker kill /
kubectl delete --force for hard bypass.
Upgrader script bumped to docker stop --timeout 610 (10s safety
margin past the grace). k8s deployments must set
terminationGracePeriodSeconds≥610 for the rolling-update path.
Tests: rpc-mode-orphan-recovery.test.ts — 7 cases covering empty,
no-orphans, dead-PID single recovery, both-files-deferred (codex's
crash-safety fix), live-PID untouched, multiple-dead-PIDs, malformed-
filename ignored.
Refs sf-mpa5kdpu (drainer orphans never recovered), sf-mpa4g46x
(original RPC hang). Codex adversarial-reviewed; the PID-reuse hardening
and crash-safety deferral landed per its feedback. Open follow-ups:
shutdown-aware /api/healthz returning 503 (codex point E), integrate
with existing forceShutdown ordering (codex point C).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before: dev-server watched packages/daemon/src + dev scripts + package.json.
SF extension source edits in src/resources/extensions/sf/ AND coding-agent
edits in packages/coding-agent/src/ did NOT trigger restart. Operators had to
restart manually after copy-resources / git pull / coding-agent edits.
Adds three watched paths:
1. packages/coding-agent/src — rpc-mode hosts sf_feedback / start_autonomous
handlers, lives here. Edits must restart the sf child.
2. dist/resources/.sf-resource-build-stamp — atomic stamp updated by
copy-resources. Watching the stamp (not the dist tree) avoids heavy
recursive walk while picking up extension upgrades the moment they land.
Idempotent: ensure-source-resources only updates the stamp when an actual
rebuild ran, so no restart-loop on identical re-runs.
3. .git/HEAD — changes on pull / branch switch / commit. Catches upgrade
flows where source moved outside this process.
Native (packages/native/) intentionally not watched — Rust build is 5–10 min,
auto-trigger would loop. Operator triggers native rebuild manually per the
existing ensure-source-resources policy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without this, edits to packages/coding-agent/src/* (or any other workspace
package src) silently land while the dist stays stale — agents continue
loading the old compiled JS and operators see "why didn't my edit take
effect?" symptoms. Observed 2026-05-17 wiring in the AST tools: vitest
(reading TS source) passed; runtime smoke test against dist failed because
no auto-rebuild fired.
Extends ensure-source-resources.cjs (which sf-from-source runs on every
launch) to also check workspace packages: agent-core, ai, coding-agent,
daemon, google-gemini-cli-provider, openai-codex-provider, rpc-client, tui.
For each, compare latest src mtime vs latest dist mtime (with a 100ms grace
window). If src is newer, run `npm run build -w @singularity-forge/<pkg>`.
Excludes:
- packages/native (Rust build is 5–10 min; trigger manually via
`node rust-engine/scripts/build.js --dev`).
- Any package in SF_SKIP_WORKSPACE_AUTOBUILD (comma-separated).
- Whole step disabled by SF_SKIP_WORKSPACE_AUTOBUILD=all.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three concrete fixes from open self-feedback assessment 2026-05-17:
- uok/gate-registry-bootstrap.js: register all 6 R081 detector gates
(same-unit-loop, zero-progress, repeated-feedback-kind, artifact-flap,
stale-lock, periodic-detector-sweep) alongside drift-detection and
iter-completion-reconciler. Closes the gap reported by
sf-mp9udspu-fsf7si — bootstrap previously registered 2 of 8 gates.
- self-feedback.js ALLOWED_KIND_DOMAINS: add `adversarial-finding`.
Closes gap reported by sf-mp9u4i25-fczmcj — R075 (autonomous
adversarial review) challenge unit had no kind to file findings under.
- sf-autonomous-watchdog.sh: delete watchdog-run-*.log files older than
60 minutes at each cycle start. Without rotation .sf/ grew to 1.9 GB
in 24h (today's snapshot). 60 min retention captures last cycle for
post-incident triage; older state is already in DB + iterations.jsonl.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- scripts/sf-meta-supervisor.mjs: pure-node daemon supervising
scripts/sf-autonomous-watchdog.sh. Tick=60s, restarts watchdog if dead,
emits .sf/meta-status.json, halt via .sf/meta-supervisor.halt. Uses
only node builtins (no SF dist deps) so it survives dist breakage.
- src/headless.ts: R091 — gate the per-cycle handleTriage call on a time
interval (SF_TRIAGE_INTERVAL_MS, default 30 min) and bump batch size
(SF_TRIAGE_MAX, default 25, was 5). Drops the ~8min triage hit from
every cycle while letting daily drain capacity rise.
- .sf/REQUIREMENTS.md: R091 (triage sidecar) + R092 (PDD-completeness
as routing signal) + R093 (pin model per orchestration agent.yaml) +
R094 (swarm-role model tier specialization — 8 roles already exist
in uok/swarm-roles.js; model field per role missing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two guards added after today's 2-hour crash-loop on missing
DEFAULT_STALE_TIMEOUT_MS export:
1. Pre-flight smoke test: \`sf --version\` must succeed before each
cycle. If dist is broken (missing export, syntax error), pause
5min + log loudly instead of immediately respawning into the same
crash.
2. Crash-loop detection: 3 consecutive <90s failure exits → assume
crash-loop, back off 5min before retry. Prevents the
"100 crashes in 2 hours, 0 useful work" pattern we just hit.
Together: a broken dist causes ONE crash + a 5min pause, not a
2-hour CPU burn. Operator notices the pause in .sf/watchdog.log
and intervenes; in the meantime no resources wasted.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that M010/S01+S02+S03 ship the inline-dispatch path (runUnitInline +
DispatchLayer + autoLoop wiring), the watchdog enables it on every
cycle so the autonomous loop actually exercises the inline scope for
INLINE_ELIGIBLE_UNITS (validate-milestone, complete-milestone,
reassess-roadmap). Other unit types continue to use the swarm path
unchanged.
This dogfoods M010/S03 in every watchdog cycle. If the inline path
regresses, the autonomous solver will surface it via self-feedback
(R015 spawn-failure loud-failure + agent-runner instrumentation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
REQUIREMENTS.md: traceability table now has 48/48 R-entries mapped to
owning milestone slices (was 40/48 unmapped). M031 owns R041-R044
(R-to-Milestone bootstrap with deep research), M032 owns R045 (R-auto-
expansion), M033 owns R046 (autonomous loop parallel dispatch), M034
owns R047 (per-R fulfillment validation), M035 owns R048 (unbroken
purpose chain).
scripts/sf-autonomous-watchdog.sh: also clears .sf/runtime/autonomous-
solver/active.json on cycle restart. Without this, a unit in
status:running from a crashed prior run made the autoLoop spin in
halt-watchdog-break (witnessed in this session: iteration 239+ in 8min
without unit progress).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/sf-autonomous-watchdog.sh — bash daemon that supervises
`sf headless autonomous` across crashes/timeouts. Per-cycle:
1. Cleans stale state (lock + zombie inline-fix dispatch)
2. Kills orphan sf processes from prior runs
3. Launches sf with 30-min hard timeout (longest sf accepts cleanly)
4. On exit (timeout / dispatch-stop / crash), logs and restarts after
15s cooldown (10min cooldown if all milestones complete)
Run: nohup bash scripts/sf-autonomous-watchdog.sh > .sf/watchdog.log 2>&1 &
Stop: pkill -f sf-autonomous-watchdog
This is the operational mode for the 2-4 week delivery horizon — SF
runs continuously, the watchdog catches all exit conditions, and
progress accumulates across many autonomous cycles.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The check-test-imports drift guard was emitting too many false positives
to be safely integrated into npm run lint (per CLAUDE.md: "NOT integrated
into npm run lint by default — too broad"). Two big classes of FP:
1) TypeScript keywords + utility types treated as undeclared (any, type,
ReturnType, Partial, Record, never, unknown, etc.) — added to the
JS_KEYWORDS set since the script doesn't otherwise distinguish JS
from TS.
2) Identifiers declared locally in the file (function declarations,
const/let/var declarations, destructured patterns, function/arrow
parameters, catch params, class names, type/interface/enum names) —
added a new collectLocalDeclarations() pass that regex-scans these
patterns and feeds the results into the filter chain.
After this patch the script no longer flags makeMockTUI / loader / tui
(local lets), `ReturnType<...>` (TS utility), or `any` (TS keyword) on
the canonical TUI test files. It still flags type-only imports
(`import type { Foo }` lines) and object-literal property names
(`{ recursive: true }`) — those remain as known FP classes documented
in the file's header for a future TS-parser-based pass.
Self-test 5/5 passes. Not yet integrating into npm run lint pending
further FP reduction; see filed self-feedback for the broader
integration plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The phaseWatchdog at 10s fired "STUCK phase=session.prompt" on every
healthy LLM call longer than 10 seconds. Verified via strace on the
running dogfood sf: bytes were actively flowing on the TLS socket
(fd 29) to the LLM provider while STUCK was being logged — the
session.prompt was never actually stuck, the watchdog was just
diagnostic-only and oblivious to stream activity.
The noOutputTimeoutMs watchdog (set to 60s for triage in commit
d80060fec) is the actual kill mechanism. It is already event-aware:
every meaningful subagent event resets the timer via armNoOutputTimer
+ isMeaningfulSubagentOutputEvent. The 10s STUCK warning was added
in commit 67e5ac9db as investigation infrastructure for the
sf-mp8e02m1-zpk903 family of bugs, but now it is just noise that
makes legitimate 30-200s LLM responses look broken.
Keeps the 10s STUCK watchdog for the three setup phases
(resourceLoader.reload, createAgentSession, bindExtensions) where
10s of silence is a real hang signal — those phases normally run in
sub-second.
Also includes:
- biome.json: bump $schema URL from 2.4.14 to 2.4.15 to match the
current biome CLI (clears the deserialize warning)
- scripts/check-test-imports.{,test.}mjs: format + drop a useless
regex escape that biome flagged in landed code
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AC1: Document convention in CLAUDE.md — test files over-importing (>5)
from a SF module should use namespace imports to avoid the anti-pattern
where a new describe() block uses an undeclared function (ReferenceError
at vitest run-time, not caught by biome lint).
AC3/AC4: add check-test-imports.mjs — static analysis script that scans
all *.test.{js,mjs,ts} files for itemized imports (≥6) + camelCase
identifier not in the import list. Exposes the failure mode at lint time.
Includes regression test (check-test-imports.test.mjs, 5/5 passing).
Closes sf-mp8ujgry-aoqcx0.
Pure formatting / lint-fix pass that ran during `npm run build:core`
in the session that landed the agent-runner / quota / coverage /
phase-2 routing work. No logic changes — indentation, trailing
commas, import sort, etc. Captured separately so the actual feature
commits stay scoped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codifies AC4 of sf-mp4w2dij-xm6cwj: the regex-only path is the
today-default fast mode. SF_SECURITY_FAST=1 is the explicit opt-in for
callers that want to assert "regex-only, no LLM escalation, sub-100ms"
regardless of any future tiered reviewer landing in the script.
Today the env var changes only the trailing status line so operators
can verify the contract is observable. When the LLM-backed review hook
(AC1) lands, the absence of SF_SECURITY_FAST becomes the trigger for
escalation; setting it=1 keeps offline / pre-commit callers on the
fast path. Locked in by tests in both the .sh and .mjs scanners.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rf-01: add ECONNREFUSED to isTransientNetworkError in anthropic-shared.ts,
aligning with the NETWORK_RE pattern in error-classifier.js
rf-02: add scripts/validate-model-cost-table.mjs to report coverage gaps
and price divergence between model-cost-table.js and models.generated.ts;
add 'validate-cost-table' script to package.json
rf-11: extract 10 pure resource-display utility functions from
interactive-mode.ts into packages/coding-agent/src/modes/interactive/
resource-display.ts, reducing interactive-mode.ts by ~282 lines
All 4375 tests pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fold sf-usage-bar, sf-notify, sf-inturn-guard, sf-permissions,
slash-commands into sf extension (ui/, notifications/, guards/,
permissions/, commands/legacy/)
- Delete vectordrive extension
- Migrate uok/kernel.js to TypeScript (kernel.ts) with full interfaces
- Add allowJs/checkJs:false to tsconfig.resources.json for incremental TS migration
- Add symlink dedup to extension-discovery.ts (seenRealPaths Set)
- Add before_provider_request delegate back to native-search.js so
session budget tests exercise the middleware end-to-end
- Fix parseSfNativeTools() to return all SF manifest tools (drop sf_ filter)
- Fix test assertions: plan_milestone/complete_task/validate_milestone
- Remove subagent from app-smoke.test.ts (folded into sf/subagent/)
- Remove sf-permissions/sf-inturn-guard/subagent from features-inventory test
- Fix resolveSearchProvider autonomous mode test to pass 'auto' explicitly
- Remove legacy /clear slash command (conflicts with built-in clear_terminal)
- Update web-command-parity-contract.test.ts for clear removal
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove stale .sf/milestones/M001/ and M002/ (not in DB, were blocking dispatch)
- dispatch-guard.js: import findMilestoneIds from milestone-ids.js directly (not
via guided-flow.js, which is in the circular-dep cluster)
- auto.js: normalize 'Cannot dispatch' → prior-slice-blocker, 'SF resources updated'
→ resources-stale, 'Stuck:' → stuck in telemetry (was silently bucketing as 'other')
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add scripts/check-circular-deps.mjs using madge; npm run check:circular
and check:circular:ext scan src/ and the SF extension respectively
- findMilestoneIds() is now DB-first: reads from milestones table when DB is
open so stale/duplicate filesystem dirs (M001/ and M001-6377a4/) are never
returned; falls back to fs scan only during early bootstrap
- milestone-id-utils.js was a stale duplicate; replaced with re-exports from
canonical milestone-ids.js
- metrics-central.js: guard null/undefined counter/gauge/histogram values
with ?? 0 to prevent NOT NULL constraint failure on metrics.value
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Dead code removed:
- ops.js: second 'rate' handler block (lines 248-256) — unreachable because
the top-level import block at line 187 fires first and returns true
- autonomous.js: 'stop' handler (trimmed === 'stop') — /stop is in
BASE_RUNTIME_COMMANDS, platform intercepts it before SF extension sees it
- core.js: 'session-rename' handler block — /rename is the canonical command;
alias added zero value and created confusion
Catalog duplicates fixed:
- 'plan' appeared twice (line 85 + 248) with contradictory descriptions;
merged into single entry describing both phase-trigger and artifact-promotion
- 'steer' appeared twice (line 72 + 167); removed the TUI-panel shortcut
entry (Shift+Tab is a keyboard binding, not a slash command)
Discoverability fix:
- 'recover' was handled in ops.js but absent from catalog and manifest;
added to both with accurate description (reconstruct DB hierarchy from
markdown on disk)
- 'session-rename' removed from catalog and manifest; users use /rename
Check script improvements:
- HIDDEN_OR_ALIAS_SUBCOMMANDS now filters both directions of the catalog
↔ handler consistency check (was only filtering 'handled but missing from
catalog', not 'catalog but no SF handler')
- Added 'stop' to HIDDEN_OR_ALIAS_SUBCOMMANDS with comment explaining it is
platform-intercepted; removed 'recover' (now properly in catalog)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- packages/native/tsconfig.json: add types:["node"] so Buffer/process/
__dirname resolve correctly (root tsconfig has no lib/types for node)
- scripts/check-sf-extension-inventory.mjs: add footer-config, undo-turn,
review-code to HIDDEN_OR_ALIAS_SUBCOMMANDS (they are aliases for statusline,
rewind, rubber-duck)
- src/resources/extensions/sf/commands/catalog.js: add session-rename entry
(real command handled in core.js, was missing from TOP_LEVEL_SUBCOMMANDS)
- src/resources/extensions/sf/extension-manifest.json: add 19 commands that
exist in catalog but were absent from provides.commands
- src/resources/extensions/sf/guided-flow.js: remove showSmartEntry compat alias
(no live imports — only a comment reference in headless-context.ts)
- src/resources/extensions/sf/graph.js: remove graphFromDefinition compat alias
build:core now passes end-to-end.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>