Commit graph

4701 commits

Author SHA1 Message Date
Mikael Hugo
f8e53840da fix(rpc, web): integrate drain into forceShutdown + healthz-503 on shutdown
Three fixes addressing codex's adversarial review of the earlier orphan-
recovery / graceful-shutdown landing:

(1) Codex point B — single shutdown path. Removed the parallel
    installGracefulShutdown() handler in rpc-mode.ts that was adding
    a second SIGTERM listener and racing forceShutdown()'s teardown.
    The drain is now the FIRST step inside forceShutdown() (before
    killTrackedDetachedChildren / extension session_shutdown / etc.)
    so DB writes complete cleanly while child processes are still
    alive to flush. Race-free against the existing shutdown ordering.

(2) Codex point D — recovery-before-each-drain. Cloud-volume mtime
    visibility lags between containers can mean an orphan `.draining`
    file from a previous container isn't visible during the startup
    scan but appears moments later. drainQueuedSfFeedbackCommands()
    now runs recoverOrphanedFeedbackDrains() as its first step, so
    each dispatch's drain sees the latest filesystem state.

(3) Codex point E — healthz returns 503 during shutdown. New module
    src/web/shutdown-state.ts holds a per-process flag, auto-registers
    SIGTERM/SIGINT/SIGHUP handlers on first read, and exposes a
    snapshot (signal, startedAt, elapsedMs) for diagnostics. The
    healthz route imports isShuttingDown() and returns 503 when set,
    so k8s readinessProbe / Forgejo blue-green probes drain traffic
    BEFORE we actually stop responding.

Tests:
  - rpc-mode-orphan-recovery.test.ts: 8/8 still green
  - web-shutdown-state.test.ts: 5/5 new — default false, mark sets
    flag, idempotent, signal exposed via snapshot, null signal for
    manual mark

Deferred to a follow-up commit (codex didn't flag, but noted for
completeness): a SIGTERM-drain child-process integration test that
spawns rpc-mode + sends a real signal. The 5 unit tests cover the
flag logic; the integration test would cover the full process tree
and is bulkier than the current commit warrants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:35:50 +02:00
Mikael Hugo
68178a9260 fix(rpc-test): use .js extension for recovery module import
tsgo rejects `.ts` extensions in imports without allowImportingTsExtensions.
Updated the test to import from "./feedback-queue-recovery.js" which is
both ESM-compatible and matches the rest of the package convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:30:10 +02:00
Mikael Hugo
d54f18c95f feat(rpc): orphan-recovery + 10-min graceful shutdown for safe container swap
Two related changes to make blue/green upgrades (per scripts/upgrade-vega-
source-server.mjs) safe for in-flight self-feedback writes.

1. Startup orphan recovery (feedback-queue-recovery.ts, extracted module).
   Scans .sf/runtime/ for sf-feedback-queue.jsonl.<pid>(.<sid>)?.draining
   files left by previous processes. For each:
     - if our own session id: leave alone (live drain)
     - if PID is alive: leave alone (foreign drainer)
     - else: rename back to queue (only if no active queue file exists)
   Crash safety: when both an orphan AND an active queue exist, we DEFER
   recovery rather than merge — appending then unlinking would risk
   duplicate replay on crash. The next restart's recovery picks it up
   once the queue is naturally drained. Supports legacy filenames
   (.<pid>.draining, pre-session-id) for backward compat.

   Added SF_DRAIN_SESSION_ID (per-process 6-byte hex) stamped into the
   .draining filename. PID reuse across container restarts is normally
   safe because /proc clears, but the session id is a stronger guarantee
   that we don't trample a foreign drainer that happens to land on the
   same PID.

2. SIGTERM/SIGINT drain-then-exit handler (installGracefulShutdown).
   Drains the queue once on signal, then exits. Bounded by
   SF_RPC_SHUTDOWN_GRACE_MS (default 600_000 = 10 min). Rationale: if
   a drain is in flight, it MUST finish — losing self-feedback writes
   across a server upgrade is worse than a long wait. Normal drains
   complete in <1s; the 10-min ceiling is for pathological lock
   contention. Operator overrides via env var, or docker kill /
   kubectl delete --force for hard bypass.

   Upgrader script bumped to docker stop --timeout 610 (10s safety
   margin past the grace). k8s deployments must set
   terminationGracePeriodSeconds≥610 for the rolling-update path.

Tests: rpc-mode-orphan-recovery.test.ts — 7 cases covering empty,
no-orphans, dead-PID single recovery, both-files-deferred (codex's
crash-safety fix), live-PID untouched, multiple-dead-PIDs, malformed-
filename ignored.

Refs sf-mpa5kdpu (drainer orphans never recovered), sf-mpa4g46x
(original RPC hang). Codex adversarial-reviewed; the PID-reuse hardening
and crash-safety deferral landed per its feedback. Open follow-ups:
shutdown-aware /api/healthz returning 503 (codex point E), integrate
with existing forceShutdown ordering (codex point C).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:29:24 +02:00
Mikael Hugo
6d8fc62243 fix: use shared sf webserver project config
Some checks are pending
sf self-deploy / build, test, and publish server image (push) Waiting to run
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
2026-05-17 22:09:28 +02:00
Mikael Hugo
c26de39afa feat: add source-mounted sf server self-deploy
Some checks are pending
sf self-deploy / build, test, and publish server image (push) Waiting to run
sf self-deploy / deploy test and probe (push) Blocked by required conditions
sf self-deploy / promote prod (push) Blocked by required conditions
2026-05-17 22:00:01 +02:00
Mikael Hugo
55a498603f fix(rpc): don't unref the sf-feedback drain timer
The drainer was scheduled via setTimeout(0) with timer.unref(). The unref
made the timer release-eligible — fine in a long-running rpc-mode child
where the process has plenty of other event-loop handles, but fatal in
the packaged-standalone path where the rpc subprocess has nothing else
to keep it alive. The process exited before the timer fired, so the
queue file was renamed to .<pid>.draining and then stranded forever.

Removed timer.unref(). The setTimeout(0) still lets the RPC response go
back to the caller first (no synchronous blocking on the drain), but the
timer now keeps the process alive until the drain handler runs, and the
drain's own async I/O keeps it alive until done.

Refs sf-mpa6wuhm-wwddd1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 21:55:23 +02:00
Mikael Hugo
cc67970fa0 fix(sf-db): share open-DB state across module instances via globalThis
Two SQLite connections were being opened in the same Node process when
the same module loaded under two graphs:

  - the autonomous-loop side loads sf-db modules via normal ESM resolution
  - src/headless-feedback.ts re-imports them via jiti.createJiti() so the
    in-server `sf headless feedback ...` drain can call them without
    bringing the agent extension into the rpc-mode bundle

Module-level `let currentDb / currentPath / currentPid` etc. lived on
two independent module instances, so each instance opened its own
SQLite handle to .sf/sf.db. WAL mode lets readers share, but two writer
connections in the same process produced SQLITE_BUSY / writer stalls —
the hang we saw on sf-mpa4g46x and the wedged-drainer recurrence after
the server restart at 19:35.

Fix: hoist the connection slot onto globalThis under a well-known
Symbol so every module instance points at the same record. All five
fields formerly module-level become `_sf.<field>` and live in one
shared object.

Codex's original diagnosis (split module-graph DB-writer contention)
was right; I dismissed it earlier because I missed that
headless-feedback uses jiti even though rpc-mode itself doesn't import
sf-db directly.

Verification:
  - Syntax check: clean
  - sf-db-migration.test.mjs: 12/13 pass. The one failure
    (openDatabase_migrates_v27_tasks_without_created_at_through_spec_backfill
    expects schema version 72, actual 73) is unrelated — a schema
    migration landed elsewhere without bumping that test's expected
    version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 21:47:01 +02:00
Mikael Hugo
a3469f2334 feat(detectors): wire gate-deadlock-classifier into the autonomous loop
Some checks are pending
CI / detect-changes (push) Waiting to run
CI / docs-check (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / build (push) Blocked by required conditions
CI / integration-tests (push) Blocked by required conditions
CI / windows-portability (push) Blocked by required conditions
CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions
CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions
CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions
Three changes that close the gap between the gate-deadlock-classifier
landed in ab2c99686 and a working detection signal.

(1) Detector wrapper now returns outcome=manual-attention (not fail) when
    a deadlock fires. The whole point of detecting the deadlock is to
    escape it — returning `fail` would add another refusal and compound
    the lockout. Same precedent as periodicDetectorSweepGate.

(2) New auto/gate-refusal-recorder.js — in-process ring buffer (cap 32,
    TTL 30 min) that records UokGate refusals from the dispatcher.
    Storage is intentionally in-memory; refusals are operational signals,
    not durable state.

(3) auto/run-unit.js — calls recordGateRefusal() at the inline-route-refused
    branch, passing the rationale (already includes `[gate-id]` prefix +
    R-id status fragments the detector parses) plus unitType/unitId.

(4) detectors/periodic-runner.js — adds a `gate-deadlock` entry to the
    default detector list, pulling ctx.gateRefusals from the caller OR
    falling back to recentGateRefusals() from the recorder. ctx can also
    override requirementCoverageByMilestone + resolveMilestoneId for tests.

After this change, an inline-route refusal flows:

  inlineRuntimeGate.execute → outcome=fail
    → run-unit.js records the refusal in gate-refusal-recorder
    → periodic-runner sweep picks it up via recentGateRefusals()
    → detectGateDeadlock cross-references against milestone coverage
    → if overlap: detectorsFired includes {name:"gate-deadlock", signature}
    → periodicDetectorSweepGate surfaces as manual-attention

Tests: 16 detector + 10 existing periodic-runner = 26/26 pass. The
existing periodic-runner test exercises the default detector list, so
adding the new entry is implicitly validated.

Follow-up still open: have the periodic sweep file a self_feedback entry
when the gate-deadlock detector fires, so the operator and SF's autonomous
triage both see the signal without polling logs. That belongs in the
sweep handler, not the detector — separate commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 21:19:29 +02:00
Mikael Hugo
ab2c996866 feat(detectors): gate-deadlock-classifier — Wiggums detector for R074 self-deadlock
The R074 inlineRuntimeGate refused inline dispatch for M048/S05 reassess-roadmap
because R020 and R066 are still 'active' — but those slices ARE the work that
validates R066. Autonomous mode stopped with no way to escape. Filed earlier as
sf-mpa4f9k1-jm01rc.

This detector classifies the pattern at runtime:

  parseGateRefusal(rationale)
    extracts gateId + refused requirement ids from gate-refusal text
    matching shape "[gate-id] ... R020=active R066=active ..."

  detectGateDeadlock(ctx, options)
    ctx.gateRefusals: recent gate refusal events ({rationale, unitType, unitId})
    ctx.requirementCoverageByMilestone: milestone -> R-ids in its DoD/coverage
    ctx.resolveMilestoneId: optional unit -> milestone resolver
        (default: strip after '/', require M-prefix)
    Returns { stuck, reason: "gate-deadlock", signature: {
      gateId, deadlockedRequirements, refusedUnits, examples, suggestedAction
    }} when any refused unit's milestone coverage overlaps the gate's refused
    requirements. Per-gateId throttle prevents repeat firings within 60s.

  gateDeadlockClassifierGate
    UokGate (type=verification per ADR-0075) wrapping the detector for
    integration into periodicDetectorSweepGate + post-finalize sweeps.

Registered in uok/gate-registry-bootstrap.js between inlineRuntimeGate and the
existing detector chain. Also re-exported from detectors/index.js for the
common detector import surface.

Test coverage:
  - parseGateRefusal: 5 cases (inline shape, dedup, missing reqs, missing gate, empty)
  - detectGateDeadlock: 7 cases (empty input, fire-on-overlap, no-overlap,
                                 empty coverage, throttle, custom resolver,
                                 examples cap)
  - UokGate wrapper: 3 cases (contract shape, pass, fail-with-findings)
  - Threshold export sanity: 1 case
  16/16 tests pass.

The wiring from autonomous-loop output (where gate refusals are emitted) into
the detector's gateRefusals input is a follow-up — this commit lands the
detector with a stable contract and tests it can be wired against.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 21:15:21 +02:00
Mikael Hugo
acd907fec2 fix: harden sf server control loop
Some checks are pending
CI / detect-changes (push) Waiting to run
CI / docs-check (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / build (push) Blocked by required conditions
CI / integration-tests (push) Blocked by required conditions
CI / windows-portability (push) Blocked by required conditions
CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions
CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions
CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions
2026-05-17 21:13:12 +02:00
Mikael Hugo
70d89eebec feat(dev-server): auto-reload on SF extension + coding-agent + git upgrades
Before: dev-server watched packages/daemon/src + dev scripts + package.json.
SF extension source edits in src/resources/extensions/sf/ AND coding-agent
edits in packages/coding-agent/src/ did NOT trigger restart. Operators had to
restart manually after copy-resources / git pull / coding-agent edits.

Adds three watched paths:

1. packages/coding-agent/src — rpc-mode hosts sf_feedback / start_autonomous
   handlers, lives here. Edits must restart the sf child.

2. dist/resources/.sf-resource-build-stamp — atomic stamp updated by
   copy-resources. Watching the stamp (not the dist tree) avoids heavy
   recursive walk while picking up extension upgrades the moment they land.
   Idempotent: ensure-source-resources only updates the stamp when an actual
   rebuild ran, so no restart-loop on identical re-runs.

3. .git/HEAD — changes on pull / branch switch / commit. Catches upgrade
   flows where source moved outside this process.

Native (packages/native/) intentionally not watched — Rust build is 5–10 min,
auto-trigger would loop. Operator triggers native rebuild manually per the
existing ensure-source-resources policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 21:03:49 +02:00
Mikael Hugo
1ac2527b36 chore: auto-commit after challenge
SF-Unit: M048/S05/challenge
2026-05-17 20:36:26 +02:00
Mikael Hugo
dd03d17089 chore: auto-commit after challenge
SF-Unit: M048/S04/challenge
2026-05-17 20:33:12 +02:00
Mikael Hugo
d8fd70e57f fix(sf): keep web autonomy on proven routes 2026-05-17 20:24:51 +02:00
Mikael Hugo
8f097f8dca chore: auto-commit after challenge
SF-Unit: M048/S03/challenge
2026-05-17 20:16:24 +02:00
Mikael Hugo
cf2d1a768e feat(sf): route server control through rpc 2026-05-17 20:07:36 +02:00
Mikael Hugo
1f7fa1222c build(ts): update native TypeScript 7 preview 2026-05-17 19:21:25 +02:00
Mikael Hugo
3adcb833ed refactor(sf): separate daemon from server identity 2026-05-17 19:18:33 +02:00
Mikael Hugo
187d736930 fix(sf): run source server with live web host 2026-05-17 19:13:10 +02:00
Mikael Hugo
f7b262f33a fix(sf): harden server pid lifecycle 2026-05-17 19:00:21 +02:00
Mikael Hugo
3568972059 fix(sf): use fixed server port 2026-05-17 18:55:21 +02:00
Mikael Hugo
425bba7d39 fix: restore full content of R074/R075 swarm files from worktrees
The prior commit (cc32ab79d) accidentally landed truncated versions of the
new R074 + R075 files due to a cherry-pick partial-state. Restored:

- inline-runtime-gate.js: 74→96 LOC
- inline-runtime-gate.test.mjs: 115→273 LOC (15 tests; 2 sonnet-imagined
  bootstrapGateRegistry/BOOTSTRAP_GATES tests rewritten to assert SF's
  actual side-effect-on-import registry pattern)
- adversarial-budget.js: 86→106 LOC
- adversarial-budget.test.mjs: 63→132 LOC (9 tests)
- adversarial-finding-bridge.js: 123→191 LOC
- adversarial-finding-bridge.test.mjs: 98→216 LOC (14 tests)

45/45 tests pass across the four affected files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 18:54:39 +02:00
Mikael Hugo
cc32ab79d9 fix(docs): remove stale hashline-{read,edit}.ts rows post-fold
Hashline read/edit tool wrappers were folded into Edit({match}) and
Read({format}) modes in commit ffdec0fee. The two rows in FILE-SYSTEM-MAP.md
pointed to files that no longer exist. Updated the surviving hashline.ts row
to note its new consumer relationship with Edit/Read.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 18:48:34 +02:00
Mikael Hugo
781a7e7319 chore(safety): narrow autonomous-rollback to flag-flip only (R066 D1)
Remove git-revert authority per operator decision M048-D1. Crash-loop
classifier sees runtime evidence, not commit attribution; reverting on
runtime symptoms risks reverting the wrong commit. On quarantine trigger,
smoke_gate is flipped false to halt ledger writes and a self-feedback entry
(kind: crash-loop-detected, severity: high) is filed with a manual-review
suggestion. Operator retains sole authority to git-revert.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 18:24:00 +02:00
Mikael Hugo
c2f101734f feat: enforce purpose-first adversarial review 2026-05-17 18:15:15 +02:00
Mikael Hugo
acafee06e2 fix: iter-completion-reconciler test uses relative timestamps
Test had fixed literal timestamps (TS_X = "2026-05-17T12:42:05.618Z")
that became stale once the calendar moved past them — the reconciler's
default maxAgeMs (1h, "older drift is operator territory") filtered
them out. By 3h after the original write the test failed: reconciled.length
was 0 because no entry passed the age filter.

Switch to NOW-relative timestamps (5/30/1 min back from Date.now()) so
the fixture always lands inside the default age window regardless of
when the test runs.

Sonnet #13 (tool rename) report flagged this test as failing alongside
the 4 known pre-existing failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 17:49:11 +02:00
Mikael Hugo
623af869b1 remove: SF voice IVR / ElevenLabs paging — migrated to centralcloud
Per operator-direction 2026-05-17 (R089 — Migrate Voice IVR / ElevenLabs
On-Call Paging Infrastructure out of SF). Migration target landed in
centralcloud monorepo:
  - centralcloud_core/lib/centralcloud_core/voice.ex (TwiML + ElevenLabs)
  - centralcloud_staff/lib/.../controllers/voice_controller.ex (Phoenix)
  - centralcloud_staff/lib/.../controllers/voice_prompt_controller.ex
  - centralcloud_staff/lib/.../router.ex (/twilio scope)

SF removal:
  - web/app/api/voice/route.ts
  - web/app/api/voice/prompt/route.ts
  - web/app/api/voice/ directory
  - src/tests/integration/web-voice-ivr-contract.test.ts

Operator-paging infra was historical drift in SF (per-project compiler);
belongs in centralcloud (org-level ops). R088 (Pre-Removal Test-Import
Safety Gate) not yet built — operator manually verified safety scan:
TWILIO_/ELEVENLABS_ env vars only referenced in the deleted files; no
internal SF callers; centralcloud version verified present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 17:42:16 +02:00
Mikael Hugo
ffdec0feee fold: hashline_edit + hashline_read → Edit({match}) + Read({format}) modes
Per operator R-entry sf-mp9wo7e3-sdxqss + no-compat directive.

- Edit gains `match: "substring"|"anchor"` arg; anchor mode routes to the
  existing applyHashlineEdits logic. Substring stays default.
- Read gains `format: "plain"|"tagged"` arg; tagged mode emits LINE#HASH
  prefixes via formatHashLines.
- Delete hashline-edit.ts, hashline-read.ts. KEEP hashline.ts (helpers
  are now Edit/Read internals).
- tools/index.ts: drop the two tools + the createHashlineCodingTools
  preset.
- agent-session.ts: setEditMode no longer swaps tool instances (single
  tool surface; mode preserved for system-prompt context only).
- sdk.ts + index.ts: remove hashline tool re-exports.
- headless-ui.ts + test: remove hashline_edit case.

Net agent-visible tool surface: -2 tools. Capability preserved as modes.
No backward-compat alias for the removed tool names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 17:39:59 +02:00
Mikael Hugo
d03758d803 feat: replace launchd with systemd user-unit install path
Operator-direction 2026-05-17 "we will never use mac" — no compat
preservation. Single-cutover replacement.

- new packages/daemon/src/systemd.ts: install/uninstall/status using
  systemctl --user + ~/.config/systemd/user/sf-server.service
- new packages/daemon/src/systemd.test.ts: ports launchd tests, same
  shape, mocked systemctl via RunCommandFn injection + SF_SYSTEMD_USER_DIR
  env override for real filesystem tests
- cli-main.ts: switch import + update help text + status messages
- index.ts: re-export systemd module (installSystemdUnit, uninstallSystemdUnit,
  systemdUnitStatus, generateUnit, getServicePath, SystemdStatus, SystemdUnitOptions)
- DELETED: launchd.ts (253 LOC), launchd.test.ts (379 LOC)
- docs/dev/drafts/M053-per-repo-supervisor.md: remove "launchd" mention
- CHANGELOG.md: document systemd-only install path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-17 17:33:34 +02:00
Mikael Hugo
44915b73d4 rename: tool names → Claude-Code-aligned (Bash/Read/Write/Edit/Grep/Glob/LS); remove run_command/read_output/hashline duplicates
Per operator-direction 2026-05-17 (sf-mp9w20y1-nld9hc + "DONT KEE COMPAT" stance + adversarial-review override). Cross-vendor frontier LLMs are trained on PascalCase Claude Code tool names; calling them by SF's lowercase + novel names increases tool-call error rates. Single atomic cutover, no aliases. Internal implementations preserved; only the LLM-facing names + registrations change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 17:26:36 +02:00
Mikael Hugo
57fef5979d feat: make sf server the operator entrypoint 2026-05-17 17:23:46 +02:00
Mikael Hugo
c046ff9a6c fix: auto-rebuild workspace packages when src is newer than dist
Without this, edits to packages/coding-agent/src/* (or any other workspace
package src) silently land while the dist stays stale — agents continue
loading the old compiled JS and operators see "why didn't my edit take
effect?" symptoms. Observed 2026-05-17 wiring in the AST tools: vitest
(reading TS source) passed; runtime smoke test against dist failed because
no auto-rebuild fired.

Extends ensure-source-resources.cjs (which sf-from-source runs on every
launch) to also check workspace packages: agent-core, ai, coding-agent,
daemon, google-gemini-cli-provider, openai-codex-provider, rpc-client, tui.
For each, compare latest src mtime vs latest dist mtime (with a 100ms grace
window). If src is newer, run `npm run build -w @singularity-forge/<pkg>`.

Excludes:
  - packages/native (Rust build is 5–10 min; trigger manually via
    `node rust-engine/scripts/build.js --dev`).
  - Any package in SF_SKIP_WORKSPACE_AUTOBUILD (comma-separated).
  - Whole step disabled by SF_SKIP_WORKSPACE_AUTOBUILD=all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 17:19:25 +02:00
Mikael Hugo
6e3b3d3c54 feat: add Serena-style AST tools (ReplaceSymbol, InsertAroundSymbol, AstGrep)
Wraps the native AST primitives from @singularity-forge/native/{edit,ast} as
LLM tools so agents can do tree-sitter-anchored code edits instead of
substring-based Edit or line-anchor hashline.

- replace-symbol.ts (+117): wraps replaceSymbol(file, symbolPath, newBody);
  matches function/class/method declarations via tree-sitter, returns
  matched=false sentinel when the symbol isn't located.
- insert-around-symbol.ts (+122): wraps insertAroundSymbol with position
  enum BeforeDecl/AfterDecl/AtBodyStart/AtBodyEnd.
- ast-grep.ts (+152): wraps astGrep for pattern matching across files with
  $VAR/$$$ARGS meta-variables; returns ranked matches with byte/line/column
  + captured meta-variable bindings.

Each tool:
  - typebox schema matching the existing AgentTool pattern (edit.ts)
  - notifyFileChanged() into the LSP layer on write ops
  - resolveToCwd() for path normalization
  - catches native errors + returns isError result with the
    NativeUnavailableError message pointing operators to
    `nix develop` + `node rust-engine/scripts/build.js --dev`

Wire-in:
- tools/index.ts: re-exports + imports + entries in `allTools` map and
  createAllTools() factory.
- extension-manifest.json: ReplaceSymbol / InsertAroundSymbol / AstGrep
  appended to provides.tools so SF extension agents see them.

Higher value than substring/line-anchor for code in tree-sitter-supported
languages (TS/JS/TSX/Python/Rust). Edit + hashline remain for non-code
files. PascalCase names per the Claude-Code-aligned convention from
sf-mp9w20y1-nld9hc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 17:14:12 +02:00
Mikael Hugo
19b10eb67c feat: make sf-server own swarm registry sync 2026-05-17 17:05:16 +02:00
Mikael Hugo
33560c9b09 fix: auto-open configured web project 2026-05-17 16:38:11 +02:00
Mikael Hugo
0f5a606923 fix: native loader — loud banner on fallback + structured load log + helpers
- Stderr banner on fallback now multi-line with concrete fix steps
  (nix develop → node rust-engine/scripts/build.js --dev) so an operator
  scanning a 280MB cycle log can't miss it. The old single-line warning
  was easy to overlook (today's "WHY HAS NOBODY SEEN IF LOUD" check).
- Structured load record per process at .sf/runtime/native-engine-load.jsonl:
  {ts, pid, platformTag, source, binaryPath, sha256, loaded, errors?}.
  Lets operators audit which binary each SF process loaded — and detect
  ABI mismatches across daemon↔worker boundaries when different sha256
  values appear for the same platformTag (the "rare but real" concern
  flagged earlier today).
- Proxy error message now points to the build/install commands instead
  of just saying "not available". NativeUnavailableError is named for
  consumer try/catch chains.
- Fixed _loadedSuccessfully ordering — was set true BEFORE the require,
  leaving stale-true after a failed first attempt.
- New helpers isNativeLoaded(), nativeBinaryPath(), nativeBinarySha256()
  for diagnostic surfaces (sf headless query, doctor checks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 16:34:02 +02:00
Mikael Hugo
f87e9bc0d9 fix: attach web server to project without token 2026-05-17 16:25:47 +02:00
Mikael Hugo
eeb80bbbdd fix: register 6 detector gates + add adversarial-finding kind + watchdog log rotation
Three concrete fixes from open self-feedback assessment 2026-05-17:

- uok/gate-registry-bootstrap.js: register all 6 R081 detector gates
  (same-unit-loop, zero-progress, repeated-feedback-kind, artifact-flap,
  stale-lock, periodic-detector-sweep) alongside drift-detection and
  iter-completion-reconciler. Closes the gap reported by
  sf-mp9udspu-fsf7si — bootstrap previously registered 2 of 8 gates.

- self-feedback.js ALLOWED_KIND_DOMAINS: add `adversarial-finding`.
  Closes gap reported by sf-mp9u4i25-fczmcj — R075 (autonomous
  adversarial review) challenge unit had no kind to file findings under.

- sf-autonomous-watchdog.sh: delete watchdog-run-*.log files older than
  60 minutes at each cycle start. Without rotation .sf/ grew to 1.9 GB
  in 24h (today's snapshot). 60 min retention captures last cycle for
  post-incident triage; older state is already in DB + iterations.jsonl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 16:08:05 +02:00
Mikael Hugo
077fd0a2a7 remove A2A; swarm enrollment + status projection + web swarms view; headless refactor
- A2A removal per M054/R071 cancellation 2026-05-17 (-2294 lines):
  - docs/plans/A2A_ADOPTION_PLAN.md, MISSION-A2A-ADOPTION.md deleted
  - src/resources/extensions/sf/uok/a2a-agent-server.js,
    a2a-transport.js deleted
  - tests/a2a-auth.test.mjs deleted
  - swarm-dispatch.js purged of A2A-conditional code paths
- New: scripts/sf-swarm-enroll.mjs + test (operator-facing swarm
  enrollment, replaces former A2A pairing flow)
- New: src/status-projection.ts + test, web/lib/swarm-status.ts +
  test, web/components/sf/swarms-view.tsx, web/app/api/swarms/
  (web swarms-view surface — direct visibility into running swarm
  state without requiring TUI; aligns with project_tui_deprecating)
- headless-{answers,query,ui,headless}.ts: coordinated tweaks
  consistent with the headless-as-default direction (R124 proposal)
- docs/dev/drafts/M053-per-repo-supervisor.md: design refinement
- .sf/REQUIREMENTS.md: small text fixes (6/6 churn)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 16:04:06 +02:00
Mikael Hugo
ecac4328bd backfill: restore 48 R-entries from REQUIREMENTS.md to DB
DB corruption recovery (2026-05-17) rebuilt the requirements table
from valid btree pages; ROWID-by-ROWID scan found 60 of 68 R-entries.
The other 8 + 40 historical (R006-R009 + R022-R065) were never in the
DB to begin with — they had drifted into REQUIREMENTS.md only. Backfill
script parsed each ### Rxxx — title block and INSERTed into the
requirements table with proper class/status/description/why/notes
fields. Final DB count: 75 → 123, integrity_check ok, MD↔DB parity
restored.

The .gitignore tweak from the meta-supervisor commit landed earlier;
no functional change here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 15:57:45 +02:00
Mikael Hugo
1cd7890d64 fix: auto-version-bump swallowed operator-direction; ptrmap + lock guards
- sf-db-schema.js: auto_vacuum INCREMENTAL → NONE. The "Bad ptr map entry"
  corruption on 2026-05-17 was incremental-autovacuum ptrmap drift under
  concurrent writers. Recovered DB has no ptrmap; future fresh DBs must
  match. incremental_vacuum() callers in sf-db-core.js become no-ops.
- bin/sf-from-source: lock allowlist extended to skip readonly sf headless
  subcommands (--help, query, status, usage, reflect, feedback list,
  triage --list/--json). Previously every sf headless invocation tried
  to acquire the project lock — operator couldn't even inspect SF state
  while autonomous was running.
- self-feedback.js triageBlockedEntries: (1) treat empty/null/undefined
  sfVersion as unknown, not zero; (2) exempt operator-direction kinds
  (improvement-idea, architecture-defect, missing-feature, gap) from
  auto-version-bump close. Both were needed to prevent the R124 incident
  recurring.
- headless-feedback.ts handleAdd: populate sfVersion via getCurrentSfVersion
  + detect repoIdentity via isForgeRepo, not hardcoded "external"/"". An
  empty sfVersion sorts below any real semver, so the resolver retry-closed
  every operator-filed entry within seconds.

Net effect: R124 proposal (filed via sf headless feedback add) is no
longer auto-resolved as version-stale. Larger architectural fix (single-
writer SF daemon / RPC for all DB writes — M040 territory) tracked as
follow-up R-entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 15:51:36 +02:00
Mikael Hugo
87e9729c13 fix: shard sift search and project requirements 2026-05-17 15:38:55 +02:00
Mikael Hugo
3e5b6fc511 fix: reconcile iteration completion drift 2026-05-17 15:06:40 +02:00
Mikael Hugo
f643272a91 fix: preserve requirements projection fidelity 2026-05-17 15:02:25 +02:00
Mikael Hugo
4289946e11 fix: clear task verification status on revert 2026-05-17 14:59:20 +02:00
Mikael Hugo
3e002ca698 refactor: consolidate loop signals and gate registry wiring 2026-05-17 14:45:12 +02:00
Mikael Hugo
4d2266e57d fix: consolidate loop supervision gates 2026-05-17 14:35:40 +02:00
Mikael Hugo
625a830d2f wire R053-R056 detectors into auto-runaway-guard + R081 UokGate retrofit
- uok/auto-runaway-guard.js: invoke runDetectorSweep alongside the existing
  zero-progress check (fire-and-forget for sync-tick compatibility; results
  consumed on next tick via sweepState ring buffer). Passes unitId,
  unitMetrics, sessionFingerprint, lockPaths, and a 30-min DB-windowed
  recentFeedback slice.
- detectors/{same-unit-loop, zero-progress, repeated-feedback-kind,
  artifact-flap, stale-lock, periodic-runner}.js: each detector now also
  exports a UokGate wrapper (id/type/execute -> GateResult per ADR-0075).
  Plain detector functions kept for existing consumers.
- detectors/index.js: single import surface for the gate exports.
- detector-stale-lock.test.mjs (9), detector-periodic-runner.test.mjs (10),
  detector-gates-contract.test.mjs: fills the R055/R056 test gap filed
  earlier today + proves UokGate contract conformance.
- 41/41 detector tests green; copy-resources clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 14:18:54 +02:00
Mikael Hugo
527ebfcaa4 gitignore meta-supervisor runtime state
The previous commit accidentally tracked .sf/meta-status.json + .sf/meta-supervisor.pid (transient runtime files written by scripts/sf-meta-supervisor.mjs each tick). Mirror the existing .sf/runtime/ ignore pattern for these top-level meta-* files; the daemon keeps writing them on disk but git no longer tracks the churn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 14:09:24 +02:00
Mikael Hugo
d5664f7142 meta-supervisor (node daemon) + R091 triage gate + R091-R094 spec
- scripts/sf-meta-supervisor.mjs: pure-node daemon supervising
  scripts/sf-autonomous-watchdog.sh. Tick=60s, restarts watchdog if dead,
  emits .sf/meta-status.json, halt via .sf/meta-supervisor.halt. Uses
  only node builtins (no SF dist deps) so it survives dist breakage.
- src/headless.ts: R091 — gate the per-cycle handleTriage call on a time
  interval (SF_TRIAGE_INTERVAL_MS, default 30 min) and bump batch size
  (SF_TRIAGE_MAX, default 25, was 5). Drops the ~8min triage hit from
  every cycle while letting daily drain capacity rise.
- .sf/REQUIREMENTS.md: R091 (triage sidecar) + R092 (PDD-completeness
  as routing signal) + R093 (pin model per orchestration agent.yaml) +
  R094 (swarm-role model tier specialization — 8 roles already exist
  in uok/swarm-roles.js; model field per role missing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 14:08:30 +02:00