Commit graph

167 commits

Author SHA1 Message Date
Mikael Hugo
0acb0f9be0 feat: harden sf server build and routing
Some checks failed
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
sf self-deploy / build, test, and publish server image (push) Has been cancelled
2026-05-18 02:33:28 +02:00
Mikael Hugo
743af0e28b remove: vega docker / source-server self-upgrade path
Now superseded by k3s self-deploy: build → push → kubectl set image
performs rolling rollout, so the in-band docker-compose-on-vega upgrade
path (docker:vega:* scripts, /api/server-upgrade route, Dockerfile.source-server,
docker-compose.vega.yaml, projects-view "Upgrade Server" button) is dead
code.

The k3s deploy workflow (.forgejo/workflows/self-deploy.yml) and sf-server
kustomization under /srv/infra/clusters/default/tenants/hugo/apps/sf-server/
are the only deploy path going forward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 01:03:58 +02:00
Mikael Hugo
06b1fefd35 fix(circular): break coding-agent core mega-cycle + skip function-body imports
Some checks are pending
sf self-deploy / build, test, and publish server image (push) Waiting to run
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
Cycle 2 (the 13-node coding-agent mega) closed via two changes:

1. scripts/check-circular-deps.mjs — track function-body depth and
   skip require()/import() calls inside function bodies. They run on
   call, not at module evaluation, and therefore cannot cause
   module-graph cycles — same reasoning as the existing dynamic
   `await import()` skip. Generic improvement; benefits any pattern
   that uses lazy CommonJS require() to break a static cycle.

2. packages/coding-agent/src/core/extensions/loader.ts — removed the
   static `import * as _bundledCodingAgent from "../../index.js"`
   self-reference, which was the cycle-closer. It only populated
   STATIC_BUNDLED_MODULES for the Bun virtualModules path
   (`isBunBinary` branch in getJitiOptions), and SF is Node-26-only
   per operator policy (no Bun) — so the Bun branch is dead at
   runtime and dropping the static self-reference is safe. The two
   map entries that referenced it (@singularity-forge/coding-agent
   and the @mariozechner alias) are commented out at the same site
   with a pointer to the top-of-file note.

Net effect across the full session:
  start of session:      9 cycles
  walker false-positive
    cleanups landed:     dropped 6 type-only + dynamic-import false
                         positives
  tui ↔ overlay-layout:  CURSOR_MARKER moved to overlay-types.ts
  SF autonomous-rollback
    chain (3 targeted
    cuts):               experimental → preferences-serializer,
                         classifier → lazy rollback import,
                         preferences-models → runaway-defaults.js
  this commit:           coding-agent loader self-reference dropped

Final state:  zero circular dependencies in 1193 scanned files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 00:42:09 +02:00
Mikael Hugo
4be963fdd1 build: ignore type-only circular edges 2026-05-18 00:26:19 +02:00
Mikael Hugo
c3b17114f3 build: keep playwright out of sf-server image 2026-05-18 00:19:19 +02:00
Mikael Hugo
ead081bfde build: use native circular dependency checker 2026-05-18 00:13:31 +02:00
Mikael Hugo
7c4f204736 fix(build): skip sf inventory git scan outside worktree
Some checks are pending
sf self-deploy / build, test, and publish server image (push) Waiting to run
sf self-deploy / upgrade vega source server (push) Blocked by required conditions
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
2026-05-17 23:24:45 +02:00
Mikael Hugo
7889cfe074 fix(build): skip versioned json git scan outside worktree
Some checks are pending
sf self-deploy / build, test, and publish server image (push) Waiting to run
sf self-deploy / upgrade vega source server (push) Blocked by required conditions
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
2026-05-17 23:21:45 +02:00
Mikael Hugo
565cd1069a fix(build): skip protected deletion check outside git worktree
Some checks are pending
sf self-deploy / build, test, and publish server image (push) Waiting to run
sf self-deploy / upgrade vega source server (push) Blocked by required conditions
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
2026-05-17 23:18:41 +02:00
Mikael Hugo
6618d6594e fix(deploy): use portable docker stop timeout flag
Some checks are pending
sf self-deploy / build, test, and publish server image (push) Waiting to run
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
2026-05-17 23:00:56 +02:00
Mikael Hugo
8c945550fa feat: operational glue for upgrade-safety chain
Some checks are pending
sf self-deploy / build, test, and publish server image (push) Waiting to run
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
Bundles the working-tree state into one coherent commit covering the
upgrade-safety glue that complements today's earlier landings
(orphan-recovery, sf-db single-connection, drain-timer-not-unref'd,
forceShutdown drain, shutdown-state.ts, instrumentation.ts,
shutdown-signal.js, gate-deadlock-classifier).

Modified:
  docker/Dockerfile.source-server — image build tweaks for the source-
    server variant used by the in-container upgrader.
  docker/docker-compose.vega.yaml — env passthroughs for host-side dirs
    (SF_SOURCE_HOST_ROOT, SF_WORKSPACE_HOST_DIR, SF_WORKSPACES_HOST_DIR,
    SF_HOME_HOST_DIR), docker socket mount, group_add for docker GID,
    and SF_RPC_SHUTDOWN_GRACE_MS=600000 matching the 10-min drain.
  scripts/run-vega-source-server.mjs — substantial rework supporting
    the in-container upgrade flow.
  scripts/upgrade-vega-source-server.mjs — buildEnv() + dockerBuildEnv()
    helpers, probeBind via SF_VEGA_PROBE_HOST, containerExists()
    pre-check before drainContainer, stop timeout now matches the
    10-min RPC grace via SF_VEGA_DRAIN_STOP_TIME (default 610s).
  src/web/project-discovery-service.ts — calls
    recoverProjectRuntimeQueues() on each of the 3 discovery paths
    (root monorepo, per-entry, nested SF projects). Closes the
    cloud-volume mtime-lag window codex flagged.
  web/app/api/ready/route.ts — calls recoverProjectRuntimeQueues() on
    every readiness probe, and now also reads shutdown-state so the
    probe returns 503 while draining.
  web/components/sf/projects-view.tsx — UI wiring for the upgrade
    trigger.
  web/pages/api/projects.ts — backend API addition for the project
    enumeration that feeds projects-view.
  docs/specs/sf-self-deploy.md — docs update for the new flow.
  package.json — script alias.

Added:
  scripts/build-web-host.mjs — new build helper for the standalone web
    host artifact consumed by the upgrade flow.
  src/resources/extensions/sf/tests/auto-shutdown-signal.test.mjs —
    unit test for the cooperative-shutdown signal module (registers /
    requests / snapshot).
  src/web/project-runtime-recovery.ts — thin wrapper around
    recoverOrphanedFeedbackDrains for per-project use from web routes.
  web/app/api/drain/route.ts — explicit drain endpoint for operator-
    triggered queue flush.
  web/app/api/server-upgrade/route.ts — auth-gated endpoint that
    spawns the in-container upgrader via docker socket; passes through
    host-dir env so the upgrader knows real bind-mount paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:57:26 +02:00
Mikael Hugo
d54f18c95f feat(rpc): orphan-recovery + 10-min graceful shutdown for safe container swap
Two related changes to make blue/green upgrades (per scripts/upgrade-vega-
source-server.mjs) safe for in-flight self-feedback writes.

1. Startup orphan recovery (feedback-queue-recovery.ts, extracted module).
   Scans .sf/runtime/ for sf-feedback-queue.jsonl.<pid>(.<sid>)?.draining
   files left by previous processes. For each:
     - if our own session id: leave alone (live drain)
     - if PID is alive: leave alone (foreign drainer)
     - else: rename back to queue (only if no active queue file exists)
   Crash safety: when both an orphan AND an active queue exist, we DEFER
   recovery rather than merge — appending then unlinking would risk
   duplicate replay on crash. The next restart's recovery picks it up
   once the queue is naturally drained. Supports legacy filenames
   (.<pid>.draining, pre-session-id) for backward compat.

   Added SF_DRAIN_SESSION_ID (per-process 6-byte hex) stamped into the
   .draining filename. PID reuse across container restarts is normally
   safe because /proc clears, but the session id is a stronger guarantee
   that we don't trample a foreign drainer that happens to land on the
   same PID.

2. SIGTERM/SIGINT drain-then-exit handler (installGracefulShutdown).
   Drains the queue once on signal, then exits. Bounded by
   SF_RPC_SHUTDOWN_GRACE_MS (default 600_000 = 10 min). Rationale: if
   a drain is in flight, it MUST finish — losing self-feedback writes
   across a server upgrade is worse than a long wait. Normal drains
   complete in <1s; the 10-min ceiling is for pathological lock
   contention. Operator overrides via env var, or docker kill /
   kubectl delete --force for hard bypass.

   Upgrader script bumped to docker stop --timeout 610 (10s safety
   margin past the grace). k8s deployments must set
   terminationGracePeriodSeconds≥610 for the rolling-update path.

Tests: rpc-mode-orphan-recovery.test.ts — 7 cases covering empty,
no-orphans, dead-PID single recovery, both-files-deferred (codex's
crash-safety fix), live-PID untouched, multiple-dead-PIDs, malformed-
filename ignored.

Refs sf-mpa5kdpu (drainer orphans never recovered), sf-mpa4g46x
(original RPC hang). Codex adversarial-reviewed; the PID-reuse hardening
and crash-safety deferral landed per its feedback. Open follow-ups:
shutdown-aware /api/healthz returning 503 (codex point E), integrate
with existing forceShutdown ordering (codex point C).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:29:24 +02:00
Mikael Hugo
6d8fc62243 fix: use shared sf webserver project config
Some checks are pending
sf self-deploy / build, test, and publish server image (push) Waiting to run
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
2026-05-17 22:09:28 +02:00
Mikael Hugo
c26de39afa feat: add source-mounted sf server self-deploy
Some checks are pending
sf self-deploy / build, test, and publish server image (push) Waiting to run
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
2026-05-17 22:00:01 +02:00
Mikael Hugo
70d89eebec feat(dev-server): auto-reload on SF extension + coding-agent + git upgrades
Before: dev-server watched packages/daemon/src + dev scripts + package.json.
SF extension source edits in src/resources/extensions/sf/ AND coding-agent
edits in packages/coding-agent/src/ did NOT trigger restart. Operators had to
restart manually after copy-resources / git pull / coding-agent edits.

Adds three watched paths:

1. packages/coding-agent/src — rpc-mode hosts sf_feedback / start_autonomous
   handlers, lives here. Edits must restart the sf child.

2. dist/resources/.sf-resource-build-stamp — atomic stamp updated by
   copy-resources. Watching the stamp (not the dist tree) avoids heavy
   recursive walk while picking up extension upgrades the moment they land.
   Idempotent: ensure-source-resources only updates the stamp when an actual
   rebuild ran, so no restart-loop on identical re-runs.

3. .git/HEAD — changes on pull / branch switch / commit. Catches upgrade
   flows where source moved outside this process.

Native (packages/native/) intentionally not watched — Rust build is 5–10 min,
auto-trigger would loop. Operator triggers native rebuild manually per the
existing ensure-source-resources policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 21:03:49 +02:00
Mikael Hugo
3adcb833ed refactor(sf): separate daemon from server identity 2026-05-17 19:18:33 +02:00
Mikael Hugo
d03758d803 feat: replace launchd with systemd user-unit install path
Operator-direction 2026-05-17 "we will never use mac" — no compat
preservation. Single-cutover replacement.

- new packages/daemon/src/systemd.ts: install/uninstall/status using
  systemctl --user + ~/.config/systemd/user/sf-server.service
- new packages/daemon/src/systemd.test.ts: ports launchd tests, same
  shape, mocked systemctl via RunCommandFn injection + SF_SYSTEMD_USER_DIR
  env override for real filesystem tests
- cli-main.ts: switch import + update help text + status messages
- index.ts: re-export systemd module (installSystemdUnit, uninstallSystemdUnit,
  systemdUnitStatus, generateUnit, getServicePath, SystemdStatus, SystemdUnitOptions)
- DELETED: launchd.ts (253 LOC), launchd.test.ts (379 LOC)
- docs/dev/drafts/M053-per-repo-supervisor.md: remove "launchd" mention
- CHANGELOG.md: document systemd-only install path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 17:33:34 +02:00
Mikael Hugo
57fef5979d feat: make sf server the operator entrypoint 2026-05-17 17:23:46 +02:00
Mikael Hugo
c046ff9a6c fix: auto-rebuild workspace packages when src is newer than dist
Without this, edits to packages/coding-agent/src/* (or any other workspace
package src) silently land while the dist stays stale — agents continue
loading the old compiled JS and operators see "why didn't my edit take
effect?" symptoms. Observed 2026-05-17 wiring in the AST tools: vitest
(reading TS source) passed; runtime smoke test against dist failed because
no auto-rebuild fired.

Extends ensure-source-resources.cjs (which sf-from-source runs on every
launch) to also check workspace packages: agent-core, ai, coding-agent,
daemon, google-gemini-cli-provider, openai-codex-provider, rpc-client, tui.
For each, compare latest src mtime vs latest dist mtime (with a 100ms grace
window). If src is newer, run `npm run build -w @singularity-forge/<pkg>`.

Excludes:
  - packages/native (Rust build is 5–10 min; trigger manually via
    `node rust-engine/scripts/build.js --dev`).
  - Any package in SF_SKIP_WORKSPACE_AUTOBUILD (comma-separated).
  - Whole step disabled by SF_SKIP_WORKSPACE_AUTOBUILD=all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 17:19:25 +02:00
Mikael Hugo
19b10eb67c feat: make sf-server own swarm registry sync 2026-05-17 17:05:16 +02:00
Mikael Hugo
eeb80bbbdd fix: register 6 detector gates + add adversarial-finding kind + watchdog log rotation
Three concrete fixes from open self-feedback assessment 2026-05-17:

- uok/gate-registry-bootstrap.js: register all 6 R081 detector gates
  (same-unit-loop, zero-progress, repeated-feedback-kind, artifact-flap,
  stale-lock, periodic-detector-sweep) alongside drift-detection and
  iter-completion-reconciler. Closes the gap reported by
  sf-mp9udspu-fsf7si — bootstrap previously registered 2 of 8 gates.

- self-feedback.js ALLOWED_KIND_DOMAINS: add `adversarial-finding`.
  Closes gap reported by sf-mp9u4i25-fczmcj — R075 (autonomous
  adversarial review) challenge unit had no kind to file findings under.

- sf-autonomous-watchdog.sh: delete watchdog-run-*.log files older than
  60 minutes at each cycle start. Without rotation .sf/ grew to 1.9 GB
  in 24h (today's snapshot). 60 min retention captures last cycle for
  post-incident triage; older state is already in DB + iterations.jsonl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 16:08:05 +02:00
Mikael Hugo
077fd0a2a7 remove A2A; swarm enrollment + status projection + web swarms view; headless refactor
- A2A removal per M054/R071 cancellation 2026-05-17 (-2294 lines):
  - docs/plans/A2A_ADOPTION_PLAN.md, MISSION-A2A-ADOPTION.md deleted
  - src/resources/extensions/sf/uok/a2a-agent-server.js,
    a2a-transport.js deleted
  - tests/a2a-auth.test.mjs deleted
  - swarm-dispatch.js purged of A2A-conditional code paths
- New: scripts/sf-swarm-enroll.mjs + test (operator-facing swarm
  enrollment, replaces former A2A pairing flow)
- New: src/status-projection.ts + test, web/lib/swarm-status.ts +
  test, web/components/sf/swarms-view.tsx, web/app/api/swarms/
  (web swarms-view surface — direct visibility into running swarm
  state without requiring TUI; aligns with project_tui_deprecating)
- headless-{answers,query,ui,headless}.ts: coordinated tweaks
  consistent with the headless-as-default direction (R124 proposal)
- docs/dev/drafts/M053-per-repo-supervisor.md: design refinement
- .sf/REQUIREMENTS.md: small text fixes (6/6 churn)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 16:04:06 +02:00
Mikael Hugo
d5664f7142 meta-supervisor (node daemon) + R091 triage gate + R091-R094 spec
- scripts/sf-meta-supervisor.mjs: pure-node daemon supervising
  scripts/sf-autonomous-watchdog.sh. Tick=60s, restarts watchdog if dead,
  emits .sf/meta-status.json, halt via .sf/meta-supervisor.halt. Uses
  only node builtins (no SF dist deps) so it survives dist breakage.
- src/headless.ts: R091 — gate the per-cycle handleTriage call on a time
  interval (SF_TRIAGE_INTERVAL_MS, default 30 min) and bump batch size
  (SF_TRIAGE_MAX, default 25, was 5). Drops the ~8min triage hit from
  every cycle while letting daily drain capacity rise.
- .sf/REQUIREMENTS.md: R091 (triage sidecar) + R092 (PDD-completeness
  as routing signal) + R093 (pin model per orchestration agent.yaml) +
  R094 (swarm-role model tier specialization — 8 roles already exist
  in uok/swarm-roles.js; model field per role missing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 14:08:30 +02:00
Mikael Hugo
8122a2b6c7 fix(watchdog): pre-flight smoke + crash-loop backoff
Two guards added after today's 2-hour crash-loop on missing
DEFAULT_STALE_TIMEOUT_MS export:

1. Pre-flight smoke test: \`sf --version\` must succeed before each
   cycle. If dist is broken (missing export, syntax error), pause
   5min + log loudly instead of immediately respawning into the same
   crash.

2. Crash-loop detection: 3 consecutive <90s failure exits → assume
   crash-loop, back off 5min before retry. Prevents the
   "100 crashes in 2 hours, 0 useful work" pattern we just hit.

Together: a broken dist causes ONE crash + a 5min pause, not a
2-hour CPU burn. Operator notices the pause in .sf/watchdog.log
and intervenes; in the meantime no resources wasted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 08:31:07 +02:00
Mikael Hugo
80ede48f06 sf snapshot: uncommitted changes after 246m inactivity 2026-05-17 08:28:04 +02:00
Mikael Hugo
460db52504 chore(watchdog): enable SF_INLINE_DISPATCH=1 by default
Now that M010/S01+S02+S03 ship the inline-dispatch path (runUnitInline +
DispatchLayer + autoLoop wiring), the watchdog enables it on every
cycle so the autonomous loop actually exercises the inline scope for
INLINE_ELIGIBLE_UNITS (validate-milestone, complete-milestone,
reassess-roadmap). Other unit types continue to use the swarm path
unchanged.

This dogfoods M010/S03 in every watchdog cycle. If the inline path
regresses, the autonomous solver will surface it via self-feedback
(R015 spawn-failure loud-failure + agent-runner instrumentation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 03:54:47 +02:00
Mikael Hugo
1b3e09d527 chore: map R041-R048 to M031-M035 + watchdog clears active.json
REQUIREMENTS.md: traceability table now has 48/48 R-entries mapped to
owning milestone slices (was 40/48 unmapped). M031 owns R041-R044
(R-to-Milestone bootstrap with deep research), M032 owns R045 (R-auto-
expansion), M033 owns R046 (autonomous loop parallel dispatch), M034
owns R047 (per-R fulfillment validation), M035 owns R048 (unbroken
purpose chain).

scripts/sf-autonomous-watchdog.sh: also clears .sf/runtime/autonomous-
solver/active.json on cycle restart. Without this, a unit in
status:running from a crashed prior run made the autoLoop spin in
halt-watchdog-break (witnessed in this session: iteration 239+ in 8min
without unit progress).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 03:43:23 +02:00
Mikael Hugo
73a464f574 feat(ops): SF autonomous watchdog for continuous unattended dispatch
scripts/sf-autonomous-watchdog.sh — bash daemon that supervises
`sf headless autonomous` across crashes/timeouts. Per-cycle:
  1. Cleans stale state (lock + zombie inline-fix dispatch)
  2. Kills orphan sf processes from prior runs
  3. Launches sf with 30-min hard timeout (longest sf accepts cleanly)
  4. On exit (timeout / dispatch-stop / crash), logs and restarts after
     15s cooldown (10min cooldown if all milestones complete)

Run: nohup bash scripts/sf-autonomous-watchdog.sh > .sf/watchdog.log 2>&1 &
Stop: pkill -f sf-autonomous-watchdog

This is the operational mode for the 2-4 week delivery horizon — SF
runs continuously, the watchdog catches all exit conditions, and
progress accumulates across many autonomous cycles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 03:39:08 +02:00
Mikael Hugo
e8bbb477e6 fix(scripts/check-test-imports): filter TS keywords + local declarations
The check-test-imports drift guard was emitting too many false positives
to be safely integrated into npm run lint (per CLAUDE.md: "NOT integrated
into npm run lint by default — too broad"). Two big classes of FP:

1) TypeScript keywords + utility types treated as undeclared (any, type,
   ReturnType, Partial, Record, never, unknown, etc.) — added to the
   JS_KEYWORDS set since the script doesn't otherwise distinguish JS
   from TS.

2) Identifiers declared locally in the file (function declarations,
   const/let/var declarations, destructured patterns, function/arrow
   parameters, catch params, class names, type/interface/enum names) —
   added a new collectLocalDeclarations() pass that regex-scans these
   patterns and feeds the results into the filter chain.

After this patch the script no longer flags makeMockTUI / loader / tui
(local lets), `ReturnType<...>` (TS utility), or `any` (TS keyword) on
the canonical TUI test files. It still flags type-only imports
(`import type { Foo }` lines) and object-literal property names
(`{ recursive: true }`) — those remain as known FP classes documented
in the file's header for a future TS-parser-based pass.

Self-test 5/5 passes. Not yet integrating into npm run lint pending
further FP reduction; see filed self-feedback for the broader
integration plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:46:10 +02:00
Mikael Hugo
f55d490e1d fix(subagent-runner): drop spurious 10s STUCK warning on session.prompt
The phaseWatchdog at 10s fired "STUCK phase=session.prompt" on every
healthy LLM call longer than 10 seconds. Verified via strace on the
running dogfood sf: bytes were actively flowing on the TLS socket
(fd 29) to the LLM provider while STUCK was being logged — the
session.prompt was never actually stuck, the watchdog was just
diagnostic-only and oblivious to stream activity.

The noOutputTimeoutMs watchdog (set to 60s for triage in commit
d80060fec) is the actual kill mechanism. It is already event-aware:
every meaningful subagent event resets the timer via armNoOutputTimer
+ isMeaningfulSubagentOutputEvent. The 10s STUCK warning was added
in commit 67e5ac9db as investigation infrastructure for the
sf-mp8e02m1-zpk903 family of bugs, but now it is just noise that
makes legitimate 30-200s LLM responses look broken.

Keeps the 10s STUCK watchdog for the three setup phases
(resourceLoader.reload, createAgentSession, bindExtensions) where
10s of silence is a real hang signal — those phases normally run in
sub-second.

Also includes:
- biome.json: bump $schema URL from 2.4.14 to 2.4.15 to match the
  current biome CLI (clears the deserialize warning)
- scripts/check-test-imports.{,test.}mjs: format + drop a useless
  regex escape that biome flagged in landed code

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:49:43 +02:00
Mikael Hugo
2eaea85020 fix(cli): add test-import-drift lint guard
AC1: Document convention in CLAUDE.md — test files over-importing (>5)
from a SF module should use namespace imports to avoid the anti-pattern
where a new describe() block uses an undeclared function (ReferenceError
at vitest run-time, not caught by biome lint).

AC3/AC4: add check-test-imports.mjs — static analysis script that scans
all *.test.{js,mjs,ts} files for itemized imports (≥6) + camelCase
identifier not in the import list. Exposes the failure mode at lint time.
Includes regression test (check-test-imports.test.mjs, 5/5 passing).

Closes sf-mp8ujgry-aoqcx0.
2026-05-16 23:26:38 +02:00
Mikael Hugo
365c6bbc3b chore: formatter / linter touch-up (230 files)
Some checks are pending
CI / detect-changes (push) Waiting to run
CI / docs-check (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / build (push) Blocked by required conditions
CI / integration-tests (push) Blocked by required conditions
CI / windows-portability (push) Blocked by required conditions
CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions
CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions
CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions
Pure formatting / lint-fix pass that ran during `npm run build:core`
in the session that landed the agent-runner / quota / coverage /
phase-2 routing work. No logic changes — indentation, trailing
commas, import sort, etc. Captured separately so the actual feature
commits stay scoped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:19:53 +02:00
Mikael Hugo
091168303c fix(auto): abort swarm checkpoint loops 2026-05-15 10:55:37 +02:00
Mikael Hugo
587b5fa31c refactor: narrow sf slash surface 2026-05-14 20:04:53 +02:00
Mikael Hugo
5ce9df2e37 refactor: make bundled agents internal 2026-05-14 19:54:56 +02:00
Mikael Hugo
18aa257ede refactor: rename review gate agent 2026-05-14 19:43:01 +02:00
Mikael Hugo
fa9baf71d5 feat(secret-scan): SF_SECURITY_FAST contract for the regex-only fast path
Codifies AC4 of sf-mp4w2dij-xm6cwj: the regex-only path is the
today-default fast mode. SF_SECURITY_FAST=1 is the explicit opt-in for
callers that want to assert "regex-only, no LLM escalation, sub-100ms"
regardless of any future tiered reviewer landing in the script.

Today the env var changes only the trailing status line so operators
can verify the contract is observable. When the LLM-backed review hook
(AC1) lands, the absence of SF_SECURITY_FAST becomes the trigger for
escalation; setting it=1 keeps offline / pre-commit callers on the
fast path. Locked in by tests in both the .sh and .mjs scanners.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:57:02 +02:00
Mikael Hugo
338c75fc6f refactor: complete rf-01/rf-02/rf-11 blocked todos
rf-01: add ECONNREFUSED to isTransientNetworkError in anthropic-shared.ts,
  aligning with the NETWORK_RE pattern in error-classifier.js

rf-02: add scripts/validate-model-cost-table.mjs to report coverage gaps
  and price divergence between model-cost-table.js and models.generated.ts;
  add 'validate-cost-table' script to package.json

rf-11: extract 10 pure resource-display utility functions from
  interactive-mode.ts into packages/coding-agent/src/modes/interactive/
  resource-display.ts, reducing interactive-mode.ts by ~282 lines

All 4375 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-11 16:45:39 +02:00
Mikael Hugo
0b5fa75c0d fix(lint): fix all pre-existing lint failures
- check-sf-extension-inventory.mjs: expand parseDirectRegisteredCommands()
  scan to include 7 more files (guards/inturn.js, notifications/notify.js,
  permissions/index.js, ui/usage-bar.js, commands/legacy/audit.js,
  commands/legacy/create-extension.js, commands/legacy/create-slash-command.js)
  and filter results by BASE_RUNTIME_COMMAND_NAMES to exclude doc-string false
  positives ("name" in create-slash-command.js template text)

- extension-manifest.json: remove 'clear' (subcommand of logs/notifications,
  never a top-level pi.registerCommand)

- packages/pi-agent-core/src/db/sf-db.ts: fix 23 noVoidTypeReturn errors
  - openDatabase: void → boolean (caller uses return value at line 5625)
  - claimEscalationOverride: void → boolean (caller checks at escalation.js:243)
  - resolveSelfFeedbackEntry: void → boolean (caller checks at self-feedback.js:387)
  - copyWorktreeDb: void → boolean (caller checks at reconcileWorktreeDb)
  - compactUokMessages: void → {before,after} (caller returns value at message-bus.js:238)
  - insertSessionTurn: void → bigint|null (caller uses id at session-recorder.js:104)
  - expireStaleMemories: void → number (caller uses count at auto-start.js:1047)
  - deleteMemorySourceRow: void → boolean (caller returns value at memory-source-store.js:107)
  - deleteMemoryEmbedding: void → boolean (caller returns value at memory-embeddings.js:328)
  - updateBacklogItemStatus: remove dead return expression (callers discard value)
  - removeBacklogItem: remove dead return expression (callers discard value)
  - updateGateCircuitBreaker: remove dead return {total,avgMs,...} (wrong-type
    code accidentally merged from getGateLatencyStats, never reachable)
  - markUokMessageRead: remove dead return true/false (callers discard value)

- Auto-fix formatting and organizeImports in ~30 source files (biome --write)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-11 04:02:31 +02:00
Mikael Hugo
adb449d642 fix: consolidate extensions into sf, migrate kernel.ts, fix test suite
- Fold sf-usage-bar, sf-notify, sf-inturn-guard, sf-permissions,
  slash-commands into sf extension (ui/, notifications/, guards/,
  permissions/, commands/legacy/)
- Delete vectordrive extension
- Migrate uok/kernel.js to TypeScript (kernel.ts) with full interfaces
- Add allowJs/checkJs:false to tsconfig.resources.json for incremental TS migration
- Add symlink dedup to extension-discovery.ts (seenRealPaths Set)
- Add before_provider_request delegate back to native-search.js so
  session budget tests exercise the middleware end-to-end
- Fix parseSfNativeTools() to return all SF manifest tools (drop sf_ filter)
- Fix test assertions: plan_milestone/complete_task/validate_milestone
- Remove subagent from app-smoke.test.ts (folded into sf/subagent/)
- Remove sf-permissions/sf-inturn-guard/subagent from features-inventory test
- Fix resolveSearchProvider autonomous mode test to pass 'auto' explicitly
- Remove legacy /clear slash command (conflicts with built-in clear_terminal)
- Update web-command-parity-contract.test.ts for clear removal

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-11 02:40:52 +02:00
Mikael Hugo
24592507c3 sf snapshot: uncommitted changes after 53m inactivity 2026-05-11 01:54:55 +02:00
Mikael Hugo
924383b6f7 sf snapshot: uncommitted changes after 197m inactivity 2026-05-10 15:59:33 +02:00
Mikael Hugo
d447095bd7 build: switch full build pipeline to TypeScript 7 native (tsgo)
Replace tsc with tsgo in all build scripts — 5.6x faster emit.
tsgo has full emit parity for this codebase (NodeNext, ES2022, strict).

- build:core: tsc → tsgo (root tsconfig.json)
- copy-resources.cjs: typescript/bin/tsc → @typescript/native-preview/bin/tsgo.js
- All workspace packages (agent-core, ai, coding-agent, daemon,
  google-gemini-cli-provider, native, rpc-client, tui): tsc → tsgo

Benchmarks (root project):
  tsc --project tsconfig.json: 7.7s
  tsgo --project tsconfig.json: 1.4s  (5.6x faster)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-10 11:58:58 +02:00
Mikael Hugo
02a4339a51 refactor: rename pi-* packages to forge-native names (Phase 1)
Rename all four packages/pi-* directories to forge-native names,
stripping the 'pi' identity and establishing forge's own:

- packages/pi-coding-agent → packages/coding-agent
- packages/pi-ai → packages/ai
- packages/pi-agent-core → packages/agent-core
- packages/pi-tui → packages/tui

Package names updated:
- @singularity-forge/pi-coding-agent → @singularity-forge/coding-agent
- @singularity-forge/pi-ai → @singularity-forge/ai
- @singularity-forge/pi-agent-core → @singularity-forge/agent-core
- @singularity-forge/pi-tui → @singularity-forge/tui

All import references, bare string references, path references,
internal variable names (_bundledPi*), and dist files updated.
@mariozechner/pi-* third-party compat aliases preserved.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-10 11:28:01 +02:00
Mikael Hugo
a3f2479a4c fix: remove stale M001/M002 milestone dirs; fix dispatch-guard circular dep; fix telemetry normalization
- Remove stale .sf/milestones/M001/ and M002/ (not in DB, were blocking dispatch)
- dispatch-guard.js: import findMilestoneIds from milestone-ids.js directly (not
  via guided-flow.js, which is in the circular-dep cluster)
- auto.js: normalize 'Cannot dispatch' → prior-slice-blocker, 'SF resources updated'
  → resources-stale, 'Stuck:' → stuck in telemetry (was silently bucketing as 'other')

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-10 02:12:13 +02:00
Mikael Hugo
ea360f6ad2 feat: add circular dep detection tool + fix duplicate milestone dirs + fix metrics NULL
- Add scripts/check-circular-deps.mjs using madge; npm run check:circular
  and check:circular:ext scan src/ and the SF extension respectively
- findMilestoneIds() is now DB-first: reads from milestones table when DB is
  open so stale/duplicate filesystem dirs (M001/ and M001-6377a4/) are never
  returned; falls back to fs scan only during early bootstrap
- milestone-id-utils.js was a stale duplicate; replaced with re-exports from
  canonical milestone-ids.js
- metrics-central.js: guard null/undefined counter/gauge/histogram values
  with ?? 0 to prevent NOT NULL constraint failure on metrics.value

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-10 01:56:08 +02:00
Mikael Hugo
6fb411df90 refactor(commands): eliminate dead handlers and catalog duplicates
Dead code removed:
- ops.js: second 'rate' handler block (lines 248-256) — unreachable because
  the top-level import block at line 187 fires first and returns true
- autonomous.js: 'stop' handler (trimmed === 'stop') — /stop is in
  BASE_RUNTIME_COMMANDS, platform intercepts it before SF extension sees it
- core.js: 'session-rename' handler block — /rename is the canonical command;
  alias added zero value and created confusion

Catalog duplicates fixed:
- 'plan' appeared twice (line 85 + 248) with contradictory descriptions;
  merged into single entry describing both phase-trigger and artifact-promotion
- 'steer' appeared twice (line 72 + 167); removed the TUI-panel shortcut
  entry (Shift+Tab is a keyboard binding, not a slash command)

Discoverability fix:
- 'recover' was handled in ops.js but absent from catalog and manifest;
  added to both with accurate description (reconstruct DB hierarchy from
  markdown on disk)
- 'session-rename' removed from catalog and manifest; users use /rename

Check script improvements:
- HIDDEN_OR_ALIAS_SUBCOMMANDS now filters both directions of the catalog
  ↔ handler consistency check (was only filtering 'handled but missing from
  catalog', not 'catalog but no SF handler')
- Added 'stop' to HIDDEN_OR_ALIAS_SUBCOMMANDS with comment explaining it is
  platform-intercepted; removed 'recover' (now properly in catalog)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-09 16:36:04 +02:00
Mikael Hugo
aca13d1d9b fix(build): fix build:core — native tsconfig types, inventory sync, compat alias catalog
- packages/native/tsconfig.json: add types:["node"] so Buffer/process/
  __dirname resolve correctly (root tsconfig has no lib/types for node)
- scripts/check-sf-extension-inventory.mjs: add footer-config, undo-turn,
  review-code to HIDDEN_OR_ALIAS_SUBCOMMANDS (they are aliases for statusline,
  rewind, rubber-duck)
- src/resources/extensions/sf/commands/catalog.js: add session-rename entry
  (real command handled in core.js, was missing from TOP_LEVEL_SUBCOMMANDS)
- src/resources/extensions/sf/extension-manifest.json: add 19 commands that
  exist in catalog but were absent from provides.commands
- src/resources/extensions/sf/guided-flow.js: remove showSmartEntry compat alias
  (no live imports — only a comment reference in headless-context.ts)
- src/resources/extensions/sf/graph.js: remove graphFromDefinition compat alias

build:core now passes end-to-end.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-09 16:18:11 +02:00
Mikael Hugo
830a259630 chore: delete superseded esbuild test-compile scripts
compile-tests.mjs and dist-test-resolve.mjs were for an older esbuild+node
--test approach. The project now uses Vitest end-to-end. Dead code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-09 16:04:41 +02:00
Mikael Hugo
bd0c612993 refactor(retire): drop JSONL fallback from judgment-log + delete one-shot migration scripts
- judgment-log.js: DB is always available; strip appendFileSync/readFileSync
  JSONL fallback paths and resolveJudgmentLogPath export. Non-fatal on DB
  failure is preserved — agent loop must never be disrupted.
- Delete scripts/migrate-to-vitest{,-all}.mjs and fix-vitest-api.mjs —
  one-shot migration tools that have already run; no longer needed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-09 15:55:10 +02:00