singularity/singularity-forge

Author	SHA1	Message	Date
Mikael Hugo	06b1fefd35	fix(circular): break coding-agent core mega-cycle + skip function-body imports Some checks are pending sf self-deploy / build, test, and publish server image (push) Waiting to run Details sf self-deploy / deploy test and probe (push) Blocked by required conditions Details sf self-deploy / promote prod (push) Blocked by required conditions Details Cycle 2 (the 13-node coding-agent mega) closed via two changes: 1. scripts/check-circular-deps.mjs — track function-body depth and skip require()/import() calls inside function bodies. They run on call, not at module evaluation, and therefore cannot cause module-graph cycles — same reasoning as the existing dynamic `await import()` skip. Generic improvement; benefits any pattern that uses lazy CommonJS require() to break a static cycle. 2. packages/coding-agent/src/core/extensions/loader.ts — removed the static `import * as _bundledCodingAgent from "../../index.js"` self-reference, which was the cycle-closer. It only populated STATIC_BUNDLED_MODULES for the Bun virtualModules path (`isBunBinary` branch in getJitiOptions), and SF is Node-26-only per operator policy (no Bun) — so the Bun branch is dead at runtime and dropping the static self-reference is safe. The two map entries that referenced it (@singularity-forge/coding-agent and the @mariozechner alias) are commented out at the same site with a pointer to the top-of-file note. Net effect across the full session: start of session: 9 cycles walker false-positive cleanups landed: dropped 6 type-only + dynamic-import false positives tui ↔ overlay-layout: CURSOR_MARKER moved to overlay-types.ts SF autonomous-rollback chain (3 targeted cuts): experimental → preferences-serializer, classifier → lazy rollback import, preferences-models → runaway-defaults.js this commit: coding-agent loader self-reference dropped Final state: ✅ zero circular dependencies in 1193 scanned files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 00:42:09 +02:00
Mikael Hugo	66309b235f	fix(circular): skip type-only imports + break tui ↔ overlay-layout cycle Some checks are pending sf self-deploy / build, test, and publish server image (push) Waiting to run Details sf self-deploy / upgrade vega source server (push) Blocked by required conditions Details sf self-deploy / deploy test and probe (push) Blocked by required conditions Details sf self-deploy / promote prod (push) Blocked by required conditions Details Two changes (one walker, one real code): 1. scripts/check-circular-deps.mjs — skip type-only imports. `import type { X } from "..."` and `export type { X } from "..."` are erased by tsc at compile time and cannot cause runtime cycles. Walker now drops them, matching the precedent set by skipping dynamic `await import(...)`. Net effect on full-repo scan: before: 9 cycles after: 3 cycles (the 6 that disappeared were all `import type` false-positives — none were real runtime cycles). 2. packages/tui — break the last 2-file cycle. tui.ts and overlay-layout.ts had a real RUNTIME cycle: - tui.ts → overlay-layout.ts: applyLineResets, compositeOverlays, extractCursorPosition, isOverlayVisible (4 fns) - overlay-layout.ts → tui.ts: CURSOR_MARKER (1 const) Both files already imported `./overlay-types.ts` (no cycle there). Moved CURSOR_MARKER from tui.ts into overlay-types.ts and re-exported from tui.ts so existing `from "./tui.js"` call sites keep working. No behavior change. Remaining cycles after both fixes (3 real-runtime ones, separate slices): - safety/autonomous-rollback chain (9 files, SF extension) - packages/coding-agent core mega-cycle (12 files) - (one more, see `npm run check:circular`) These are foundational refactors worth their own commits, not bundled into this one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 00:28:53 +02:00
Mikael Hugo	f8e53840da	fix(rpc, web): integrate drain into forceShutdown + healthz-503 on shutdown Three fixes addressing codex's adversarial review of the earlier orphan- recovery / graceful-shutdown landing: (1) Codex point B — single shutdown path. Removed the parallel installGracefulShutdown() handler in rpc-mode.ts that was adding a second SIGTERM listener and racing forceShutdown()'s teardown. The drain is now the FIRST step inside forceShutdown() (before killTrackedDetachedChildren / extension session_shutdown / etc.) so DB writes complete cleanly while child processes are still alive to flush. Race-free against the existing shutdown ordering. (2) Codex point D — recovery-before-each-drain. Cloud-volume mtime visibility lags between containers can mean an orphan `.draining` file from a previous container isn't visible during the startup scan but appears moments later. drainQueuedSfFeedbackCommands() now runs recoverOrphanedFeedbackDrains() as its first step, so each dispatch's drain sees the latest filesystem state. (3) Codex point E — healthz returns 503 during shutdown. New module src/web/shutdown-state.ts holds a per-process flag, auto-registers SIGTERM/SIGINT/SIGHUP handlers on first read, and exposes a snapshot (signal, startedAt, elapsedMs) for diagnostics. The healthz route imports isShuttingDown() and returns 503 when set, so k8s readinessProbe / Forgejo blue-green probes drain traffic BEFORE we actually stop responding. Tests: - rpc-mode-orphan-recovery.test.ts: 8/8 still green - web-shutdown-state.test.ts: 5/5 new — default false, mark sets flag, idempotent, signal exposed via snapshot, null signal for manual mark Deferred to a follow-up commit (codex didn't flag, but noted for completeness): a SIGTERM-drain child-process integration test that spawns rpc-mode + sends a real signal. The 5 unit tests cover the flag logic; the integration test would cover the full process tree and is bulkier than the current commit warrants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 22:35:50 +02:00
Mikael Hugo	68178a9260	fix(rpc-test): use .js extension for recovery module import tsgo rejects `.ts` extensions in imports without allowImportingTsExtensions. Updated the test to import from "./feedback-queue-recovery.js" which is both ESM-compatible and matches the rest of the package convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 22:30:10 +02:00
Mikael Hugo	d54f18c95f	feat(rpc): orphan-recovery + 10-min graceful shutdown for safe container swap Two related changes to make blue/green upgrades (per scripts/upgrade-vega- source-server.mjs) safe for in-flight self-feedback writes. 1. Startup orphan recovery (feedback-queue-recovery.ts, extracted module). Scans .sf/runtime/ for sf-feedback-queue.jsonl.<pid>(.<sid>)?.draining files left by previous processes. For each: - if our own session id: leave alone (live drain) - if PID is alive: leave alone (foreign drainer) - else: rename back to queue (only if no active queue file exists) Crash safety: when both an orphan AND an active queue exist, we DEFER recovery rather than merge — appending then unlinking would risk duplicate replay on crash. The next restart's recovery picks it up once the queue is naturally drained. Supports legacy filenames (.<pid>.draining, pre-session-id) for backward compat. Added SF_DRAIN_SESSION_ID (per-process 6-byte hex) stamped into the .draining filename. PID reuse across container restarts is normally safe because /proc clears, but the session id is a stronger guarantee that we don't trample a foreign drainer that happens to land on the same PID. 2. SIGTERM/SIGINT drain-then-exit handler (installGracefulShutdown). Drains the queue once on signal, then exits. Bounded by SF_RPC_SHUTDOWN_GRACE_MS (default 600_000 = 10 min). Rationale: if a drain is in flight, it MUST finish — losing self-feedback writes across a server upgrade is worse than a long wait. Normal drains complete in <1s; the 10-min ceiling is for pathological lock contention. Operator overrides via env var, or docker kill / kubectl delete --force for hard bypass. Upgrader script bumped to docker stop --timeout 610 (10s safety margin past the grace). k8s deployments must set terminationGracePeriodSeconds≥610 for the rolling-update path. Tests: rpc-mode-orphan-recovery.test.ts — 7 cases covering empty, no-orphans, dead-PID single recovery, both-files-deferred (codex's crash-safety fix), live-PID untouched, multiple-dead-PIDs, malformed- filename ignored. Refs sf-mpa5kdpu (drainer orphans never recovered), sf-mpa4g46x (original RPC hang). Codex adversarial-reviewed; the PID-reuse hardening and crash-safety deferral landed per its feedback. Open follow-ups: shutdown-aware /api/healthz returning 503 (codex point E), integrate with existing forceShutdown ordering (codex point C). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 22:29:24 +02:00
Mikael Hugo	55a498603f	fix(rpc): don't unref the sf-feedback drain timer The drainer was scheduled via setTimeout(0) with timer.unref(). The unref made the timer release-eligible — fine in a long-running rpc-mode child where the process has plenty of other event-loop handles, but fatal in the packaged-standalone path where the rpc subprocess has nothing else to keep it alive. The process exited before the timer fired, so the queue file was renamed to .<pid>.draining and then stranded forever. Removed timer.unref(). The setTimeout(0) still lets the RPC response go back to the caller first (no synchronous blocking on the drain), but the timer now keeps the process alive until the drain handler runs, and the drain's own async I/O keeps it alive until done. Refs sf-mpa6wuhm-wwddd1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 21:55:23 +02:00
Mikael Hugo	acd907fec2	fix: harden sf server control loop Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details	2026-05-17 21:13:12 +02:00
Mikael Hugo	cf2d1a768e	feat(sf): route server control through rpc	2026-05-17 20:07:36 +02:00
Mikael Hugo	3adcb833ed	refactor(sf): separate daemon from server identity	2026-05-17 19:18:33 +02:00
Mikael Hugo	cc32ab79d9	fix(docs): remove stale hashline-{read,edit}.ts rows post-fold Hashline read/edit tool wrappers were folded into Edit({match}) and Read({format}) modes in commit `ffdec0fee`. The two rows in FILE-SYSTEM-MAP.md pointed to files that no longer exist. Updated the surviving hashline.ts row to note its new consumer relationship with Edit/Read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 18:48:34 +02:00
Mikael Hugo	623af869b1	remove: SF voice IVR / ElevenLabs paging — migrated to centralcloud Per operator-direction 2026-05-17 (R089 — Migrate Voice IVR / ElevenLabs On-Call Paging Infrastructure out of SF). Migration target landed in centralcloud monorepo: - centralcloud_core/lib/centralcloud_core/voice.ex (TwiML + ElevenLabs) - centralcloud_staff/lib/.../controllers/voice_controller.ex (Phoenix) - centralcloud_staff/lib/.../controllers/voice_prompt_controller.ex - centralcloud_staff/lib/.../router.ex (/twilio scope) SF removal: - web/app/api/voice/route.ts - web/app/api/voice/prompt/route.ts - web/app/api/voice/ directory - src/tests/integration/web-voice-ivr-contract.test.ts Operator-paging infra was historical drift in SF (per-project compiler); belongs in centralcloud (org-level ops). R088 (Pre-Removal Test-Import Safety Gate) not yet built — operator manually verified safety scan: TWILIO_/ELEVENLABS_ env vars only referenced in the deleted files; no internal SF callers; centralcloud version verified present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 17:42:16 +02:00
Mikael Hugo	ffdec0feee	fold: hashline_edit + hashline_read → Edit({match}) + Read({format}) modes Per operator R-entry sf-mp9wo7e3-sdxqss + no-compat directive. - Edit gains `match: "substring"\|"anchor"` arg; anchor mode routes to the existing applyHashlineEdits logic. Substring stays default. - Read gains `format: "plain"\|"tagged"` arg; tagged mode emits LINE#HASH prefixes via formatHashLines. - Delete hashline-edit.ts, hashline-read.ts. KEEP hashline.ts (helpers are now Edit/Read internals). - tools/index.ts: drop the two tools + the createHashlineCodingTools preset. - agent-session.ts: setEditMode no longer swaps tool instances (single tool surface; mode preserved for system-prompt context only). - sdk.ts + index.ts: remove hashline tool re-exports. - headless-ui.ts + test: remove hashline_edit case. Net agent-visible tool surface: -2 tools. Capability preserved as modes. No backward-compat alias for the removed tool names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:39:59 +02:00
Mikael Hugo	d03758d803	feat: replace launchd with systemd user-unit install path Operator-direction 2026-05-17 "we will never use mac" — no compat preservation. Single-cutover replacement. - new packages/daemon/src/systemd.ts: install/uninstall/status using systemctl --user + ~/.config/systemd/user/sf-server.service - new packages/daemon/src/systemd.test.ts: ports launchd tests, same shape, mocked systemctl via RunCommandFn injection + SF_SYSTEMD_USER_DIR env override for real filesystem tests - cli-main.ts: switch import + update help text + status messages - index.ts: re-export systemd module (installSystemdUnit, uninstallSystemdUnit, systemdUnitStatus, generateUnit, getServicePath, SystemdStatus, SystemdUnitOptions) - DELETED: launchd.ts (253 LOC), launchd.test.ts (379 LOC) - docs/dev/drafts/M053-per-repo-supervisor.md: remove "launchd" mention - CHANGELOG.md: document systemd-only install path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:33:34 +02:00
Mikael Hugo	57fef5979d	feat: make sf server the operator entrypoint	2026-05-17 17:23:46 +02:00
Mikael Hugo	6e3b3d3c54	feat: add Serena-style AST tools (ReplaceSymbol, InsertAroundSymbol, AstGrep) Wraps the native AST primitives from @singularity-forge/native/{edit,ast} as LLM tools so agents can do tree-sitter-anchored code edits instead of substring-based Edit or line-anchor hashline. - replace-symbol.ts (+117): wraps replaceSymbol(file, symbolPath, newBody); matches function/class/method declarations via tree-sitter, returns matched=false sentinel when the symbol isn't located. - insert-around-symbol.ts (+122): wraps insertAroundSymbol with position enum BeforeDecl/AfterDecl/AtBodyStart/AtBodyEnd. - ast-grep.ts (+152): wraps astGrep for pattern matching across files with $VAR/$$$ARGS meta-variables; returns ranked matches with byte/line/column + captured meta-variable bindings. Each tool: - typebox schema matching the existing AgentTool pattern (edit.ts) - notifyFileChanged() into the LSP layer on write ops - resolveToCwd() for path normalization - catches native errors + returns isError result with the NativeUnavailableError message pointing operators to `nix develop` + `node rust-engine/scripts/build.js --dev` Wire-in: - tools/index.ts: re-exports + imports + entries in `allTools` map and createAllTools() factory. - extension-manifest.json: ReplaceSymbol / InsertAroundSymbol / AstGrep appended to provides.tools so SF extension agents see them. Higher value than substring/line-anchor for code in tree-sitter-supported languages (TS/JS/TSX/Python/Rust). Edit + hashline remain for non-code files. PascalCase names per the Claude-Code-aligned convention from sf-mp9w20y1-nld9hc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 17:14:12 +02:00
Mikael Hugo	19b10eb67c	feat: make sf-server own swarm registry sync	2026-05-17 17:05:16 +02:00
Mikael Hugo	0f5a606923	fix: native loader — loud banner on fallback + structured load log + helpers - Stderr banner on fallback now multi-line with concrete fix steps (nix develop → node rust-engine/scripts/build.js --dev) so an operator scanning a 280MB cycle log can't miss it. The old single-line warning was easy to overlook (today's "WHY HAS NOBODY SEEN IF LOUD" check). - Structured load record per process at .sf/runtime/native-engine-load.jsonl: {ts, pid, platformTag, source, binaryPath, sha256, loaded, errors?}. Lets operators audit which binary each SF process loaded — and detect ABI mismatches across daemon↔worker boundaries when different sha256 values appear for the same platformTag (the "rare but real" concern flagged earlier today). - Proxy error message now points to the build/install commands instead of just saying "not available". NativeUnavailableError is named for consumer try/catch chains. - Fixed _loadedSuccessfully ordering — was set true BEFORE the require, leaving stale-true after a failed first attempt. - New helpers isNativeLoaded(), nativeBinaryPath(), nativeBinarySha256() for diagnostic surfaces (sf headless query, doctor checks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 16:34:02 +02:00
Mikael Hugo	9a84d82cdb	chore(release): 2.75.3 → 2.75.4 + workspace dependency refresh Bumps version across the workspace (root + 10 @singularity-forge/* packages) and lands the pending dependency refresh that had been sitting uncommitted: @anthropic-ai/sdk 0.95.1 → 0.96.0 @anthropic-ai/vertex-sdk 0.14.4 → 0.16.0 @google/genai 2.0 → 2.3 @logtape/{file,logtape,pretty,redaction} 2.0.7 → 2.0.9 @smithy/node-http-handler 4.7.0 → 4.7.3 @clack/prompts 1.3 → 1.4 @types/mime-types 2.1 → 3.0 Inter-package refs in packages/{daemon,ai}/package.json bumped to ^2.75.4 so the workspace stays self-consistent. package-lock.json regenerated via `npm install --package-lock-only --legacy-peer-deps`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 23:59:14 +02:00
Mikael Hugo	f55d490e1d	fix(subagent-runner): drop spurious 10s STUCK warning on session.prompt The phaseWatchdog at 10s fired "STUCK phase=session.prompt" on every healthy LLM call longer than 10 seconds. Verified via strace on the running dogfood sf: bytes were actively flowing on the TLS socket (fd 29) to the LLM provider while STUCK was being logged — the session.prompt was never actually stuck, the watchdog was just diagnostic-only and oblivious to stream activity. The noOutputTimeoutMs watchdog (set to 60s for triage in commit `d80060fec`) is the actual kill mechanism. It is already event-aware: every meaningful subagent event resets the timer via armNoOutputTimer + isMeaningfulSubagentOutputEvent. The 10s STUCK warning was added in commit `67e5ac9db` as investigation infrastructure for the sf-mp8e02m1-zpk903 family of bugs, but now it is just noise that makes legitimate 30-200s LLM responses look broken. Keeps the 10s STUCK watchdog for the three setup phases (resourceLoader.reload, createAgentSession, bindExtensions) where 10s of silence is a real hang signal — those phases normally run in sub-second. Also includes: - biome.json: bump $schema URL from 2.4.14 to 2.4.15 to match the current biome CLI (clears the deserialize warning) - scripts/check-test-imports.{,test.}mjs: format + drop a useless regex escape that biome flagged in landed code Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 23:49:43 +02:00
Mikael Hugo	365c6bbc3b	chore: formatter / linter touch-up (230 files) Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details Pure formatting / lint-fix pass that ran during `npm run build:core` in the session that landed the agent-runner / quota / coverage / phase-2 routing work. No logic changes — indentation, trailing commas, import sort, etc. Captured separately so the actual feature commits stay scoped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 21:19:53 +02:00
Mikael Hugo	67e5ac9db1	diag(subagent-runner): per-phase timing + stuck-watchdog for sf-mp8e02m1-zpk903 Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details Adds visible diagnostics to runSubagent so the next time the "session initialized but no LLM call" bug fires, the log identifies which setup phase hangs. Phases instrumented: - resourceLoader.reload() - createAgentSession() - bindExtensions(runLifecycle=...) - session.prompt() entry → return Output format (stderr, prefixed with [subagent:<name>]): phase=resourceLoader.reload 23ms phase=createAgentSession 142ms phase=bindExtensions 89ms runLifecycle=true phase=session.prompt-entered taskLen=8421 timeoutMs=480000 noOutputMs=180000 phase=session.prompt-returned 16234ms ← normal completion STUCK phase=<X> 10000ms (no completion signal ...) ← when watchdog fires Each phase has a soft 10s watchdog that emits a STUCK line if the await doesn't complete in time. The watchdog never aborts — just surfaces visibility. Existing timeoutMs / noOutputTimeoutMs handle actual termination. This is investigation infrastructure for the third prompt-never-sent seam (coding-agent/subagent-runner). The agent-runner.js seam (sf-mp8g4rcd-w01tkh) was fixed in commit `8ee4d8358` with bounded retries. This commit doesn't fix the underlying bug — it makes the bug self-reporting next time it fires so operator and autonomous loop both get actionable signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:40:17 +02:00
Mikael Hugo	b5764af27b	sf snapshot: uncommitted changes after 33m inactivity	2026-05-16 17:00:13 +02:00
Mikael Hugo	da0c41d375	sf snapshot: uncommitted changes after 56m inactivity	2026-05-16 14:59:40 +02:00
Mikael Hugo	e2e096c5c7	feat(rpc): configurable RPC init timeout via SF_RPC_INIT_TIMEOUT_MS Add resolveRpcInitTimeoutMs() helper and wire it into RpcClient.init(). Default init timeout increased from 30s to 120s. Override via env var. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 20:00:26 +02:00
Mikael Hugo	3a14fe86a7	test(list-models): isolate from developer's discovery-cache Tests were picking up the developer's real ~/.sf/agent/discovery-cache.json and seeing unexpected models in output. Pin tests to a guaranteed-missing path via the new _discoveryCacheFilePath option so the env they observe is solely what the test constructs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 16:37:11 +02:00
Mikael Hugo	7ba469cff1	feat(memory): add debug logging to memory extraction pipeline Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details The memory extraction system has infrastructure (DB tables, LLM prompts, unit closeout wiring, embedding backfill) but zero processed units and only self-feedback-resolution memories. This suggests extraction is failing silently. Add debugLog() calls throughout extractMemoriesFromUnit() so we can observe: - Skip reasons (mutex busy, rate limited, already processed, file too small) - Start/done lifecycle per unit - LLM call and parse outcomes - Error messages on failure and retry This makes the extraction pipeline observable via --debug or the journal/debug log without changing behavior. Tests: 185 files / 1993 tests pass. Type check: clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 16:09:36 +02:00
Mikael Hugo	d57cd84d9a	fix(auto): make halt watchdog observable	2026-05-15 08:09:02 +02:00
Mikael Hugo	f9c147a08b	fix(swarm): ignore heartbeats for silent worker timeout	2026-05-15 08:00:35 +02:00
Mikael Hugo	e464a1bd6e	fix(swarm): bound silent worker responses	2026-05-15 07:35:31 +02:00
Copilot	cf9203aee0	feat(swarm): forward parent permission profile to in-process worker sessions Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details In-process swarm workers get a fresh headless AgentSession whose permission extension defaults to read-only minimal. This blocks normal autonomous edits (e.g., write_file, edit) even when the parent session runs at normal or trusted level. - run-unit.js: add legacyPermissionLevelForProfile mapping and include executorPermissionLevel in the dispatch envelope. - swarm-dispatch.js: forward executorPermissionLevel from envelope to runAgentTurn as permissionLevel. - agent-runner.js: accept permissionLevel option and pass it to runSubagent config. - subagent-runner.ts: add permissionLevel to SubagentConfig; when set, temporarily set SF_PERMISSION_LEVEL env and run extension lifecycle so the permission extension reads the level before tool hooks execute. - Tests for envelope field, dispatch forwarding, and run-unit integration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 06:38:42 +02:00
Mikael Hugo	dbfaca61cf	fix(swarm): surface worker tool call count to bypass parent-ledger guard Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details Round 7 dogfood failed with "0 tool calls — context exhaustion" even though the swarm worker's session DID call tools. Root cause: the phases-unit.js zero-tool-call guard reads from the PARENT session's message ledger via snapshotUnitMetrics. The swarm worker runs in an ISOLATED subagent session — its tool calls never appear in the parent's messages, so the guard always sees 0 and fires a false- positive context-exhaustion retry. Fix: - runUnitViaSwarm now returns swarmToolCallCount on the UnitResult, surfacing the real worker tool call count from the onEvent stream (collectedToolCalls.length, accurate end-to-end). - phases-unit.js zero-tool-call guard checks unitResult._via === "swarm" && swarmToolCallCount > 0 and bypasses the false-positive retry, logging "zero-tool-calls-swarm-bypass". Also adds a debug stderr line in subagent-runner.ts printing the tool count after bindExtensions, confirming the worker session HAS the full tool set (checkpoint + built-ins) — Hypotheses 1 and 2 from the Round 8 brief ruled out by direct observation. Tests: 3 new (swarmToolCallCount = 0 / N / 1-on-checkpoint-only); 2518 tests pass total, 0 regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 05:46:17 +02:00
Mikael Hugo	46d9d45279	fix(bash): block wrong project python runtime	2026-05-15 05:33:28 +02:00
Mikael Hugo	54ac56d9bd	feat(swarm): honor worker checkpoint outcomes	2026-05-15 04:59:15 +02:00
Mikael Hugo	1115437cec	feat(swarm): event streaming + outcome derivation for runUnitViaSwarm - Forward onEvent through swarm-dispatch → agent-runner → runSubagent - Collect toolcall_end events in runUnitViaSwarm to build real tool-use blocks - Detect checkpoint tool outcome for accurate unit completion signal - Add headless.ts graceful shutdown (async signal handler, 2.5s timeout) - RPC client stop() now awaits flush and propagates stop to child sessions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 04:54:58 +02:00
Mikael Hugo	903cdd4d9d	feat(subagent): event streaming for in-process runSubagent Add RunSubagentOptions.onEvent callback so callers (TUI live update panel for /delegate, /rubber-duck, etc.) get every session event without polling. Errors from the callback are caught so a buggy caller cannot crash the agent. Chain caller-supplied AbortSignal through a local AbortController in runSingleAgent and register it in a new liveSubagentControllers set so stopLiveSubagents aborts in-process subagents alongside the legacy spawn-based processes (cmux split, sift codebase_search). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 04:04:52 +02:00
Mikael Hugo	62f886430c	fix: run subagents in process by default	2026-05-15 03:59:34 +02:00
Mikael Hugo	8b0f0bbd65	fix: harden headless dogfood self-healing	2026-05-15 03:53:15 +02:00
Mikael Hugo	f0c3eaf999	refactor(extensions): merge ttsr into guardrails TTSR (Time Traveling Stream Rules) monitored streaming output against regex patterns. Guardrails blocked dangerous actions and redacted secrets. Both are safety/guardrail concerns — merging them into one extension reduces surface area and simplifies the safety model. Changes: - Copied ttsr-rule-loader.js, ttsr-manager.js, ttsr-interrupt.md into guardrails/ - Updated guardrails extension-manifest.json with ttsr hooks (turn_start, message_update, turn_end, agent_end) - Integrated TTSR session_start/turn_start/message_update/turn_end/agent_end handlers into guardrails/index.js - Deleted ttsr/ extension directory Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 02:28:40 +02:00
Mikael Hugo	2d5a05a48b	fix(security): resolve 7 findings from full-repo code review - Create web/middleware.ts to authenticate all API routes via bearer token and origin checks (previously unauthenticated due to missing middleware file) - Fix path traversal in browse-directories: replace startsWith with realpathSync + relative + isAbsolute containment checks - Fix XSS in session HTML export: escape raw HTML blocks via marked renderer - Fix PTY process leak: destroy session on SSE stream cancellation - Fix unhandled exception in terminal sessions POST: wrap getOrCreateSession in try/catch with structured JSON error response - Fix silent child-process failure in headless dispatch: add exit handler to write failed claim when sf headless triage exits non-zero - Fix TypeError on malformed claim JSON: add Array.isArray guard before accessing claim.ids.length All changes type-check cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 02:18:43 +02:00
Mikael Hugo	def1edefa9	sf snapshot: uncommitted changes after 268m inactivity	2026-05-15 02:08:06 +02:00
Mikael Hugo	2e4bdd292c	fix: keep hidden sf commands callable in print mode	2026-05-14 21:25:18 +02:00
Mikael Hugo	f88b48b0aa	fix: show print mode liveness	2026-05-14 20:59:19 +02:00
Mikael Hugo	487237a32c	fix: bound sf print mode and chat routing	2026-05-14 20:55:00 +02:00
Mikael Hugo	47867c1236	feat: route clear sf chat commands	2026-05-14 20:21:37 +02:00
Mikael Hugo	ab1a1edcf9	refactor: tier sf slash commands	2026-05-14 20:14:09 +02:00
Mikael Hugo	7ea41b89ae	feat(ai,coding-agent): wireModelId — provider deployment alias Adds an optional wireModelId field to the Model interface and a resolveWireModelId helper. Forge's canonical model.id stays stable for selection, capability scoring, policy, and history; providers now send model.wireModelId on the wire when set, model.id otherwise. Use cases: Azure deployment names, vendor model slugs that differ from Forge's canonical identity, A/B routing where the operator wants canonical history but a specific deployment. Wired through every provider in @singularity-forge/ai (anthropic, amazon-bedrock, azure-openai-responses, google, google-vertex, google-gemini-cli, mistral, openai-codex-responses, openai-completions, openai-responses) plus @singularity-forge/coding-agent's ModelRegistry (model definitions + per-model overrides). Tests: openai-completions wireModelId payload coverage + model-registry-auth-mode coverage for the override + definition fields. Full pi-ai + coding-agent suite: 956/956 ✓ (7 unrelated skipped). This realizes the model-registry contract drafted in `1d753af6b`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 09:25:21 +02:00
Mikael Hugo	a342868068	feat(packages): extract @singularity-forge/openai-codex-provider Mirrors the @singularity-forge/google-gemini-cli-provider package layout for the codex CLI integration boundary. The new package owns: - CodexAppServerClient (the JSON-RPC subprocess client; previously packages/ai/src/providers/codex-app-server-client.ts, no pi-ai internal coupling) - snapshotCodexCliAccount / discoverCodexCliModels (reads ~/.codex/models_cache.json with visibility=list ∧ supported_in_api filter; previously inline in src/resources/extensions/sf/openai-codex-catalog.js) openai-codex-responses.ts (the stream-shaping provider) intentionally stays in @singularity-forge/ai because it depends on pi-ai stream-event internals and is not reusable outside the provider — same scope as google-gemini-cli.ts vs google-gemini-cli-provider. The SF extension's openai-codex-catalog.js is now a thin SF-side cache writer that delegates to discoverCodexCliModels, mirroring how gemini-catalog.js delegates to discoverGeminiCliModels. readCodexAvailableModels became async to match the dynamic-import path; tests updated. Closes sf-mp4u5fcz-wh6ac9 (with documented AC2 narrowing — see resolution). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 06:48:19 +02:00
Mikael Hugo	f68ab20953	fix(ai): backfill MiniMax M2/M2.1 cacheRead pricing	2026-05-14 04:55:46 +02:00
Mikael Hugo	383e495085	feat(headless,gemini-cli): add sf headless usage + unify gemini quota path Adds a machine-readable headless surface for live LLM-provider usage and unifies the gemini-cli quota fetch through one helper, removing the duplication that existed between usage-bar.js and the new package. 1. snapshotGeminiCliAccount in @singularity-forge/google-gemini-cli-provider - Single source of truth for { projectId, userTierId, userTierName, paidTier, models[] } via setupUser + retrieveUserQuota. - Dedups buckets per modelId, keeping the worst (lowest remainingFraction) so consumers always see the most-restrictive window. Code Assist sometimes returns multiple buckets per model; the pessimistic choice is what every consumer needs. - discoverGeminiCliModels(cwd?) wraps it for catalog-cache callers that only need the IDs. 2. sf headless usage subcommand - New src/headless-usage.ts handler. text (default) and --json output. Uses the package's snapshot directly — no RPC child, no jiti gymnastics — matching the shape of headless-uok-status / headless-doctor. - Wired into src/headless.ts after the doctor block. - Help text adds the command line. 3. usage-bar.js refactored to delegate - fetchGeminiUsage no longer imports gemini-cli-core directly. It calls snapshotGeminiCliAccount and reshapes the result into the existing { provider, displayName, windows[] } UI contract. - Eliminates the duplicate setupUser + retrieveUserQuota code path. - The fast existsSync(~/.gemini/oauth_creds.json) pre-flight stays so unauth'd users get a friendly message without paying for OAuth bootstrap. 4. Model registry refactor (separate track committed alongside) - src/resources/extensions/sf/model-registry.ts (new) consolidates canonical model identity, capability tier, and generation tags into one source of truth that auto-model-selection, benchmark-selector, and model-router now consume instead of maintaining parallel maps. All 1487 tests pass (151 files); typecheck clean for both the package and the SF extensions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 03:42:53 +02:00
Mikael Hugo	c6a3fa6a6a	feat(gemini-cli): discover account models via gemini-cli-core + retry on capacity errors Two related fixes for the google-gemini-cli provider, both motivated by today's dogfood diagnosis: SF was pinned to a single model (gemini-3-flash-preview) even though the AI Ultra account has access to seven (verified via the live gemini-cli-core probe), and a transient "No capacity available for model X on the server" was classified as `unknown` so SF gave up instead of retrying. 1. Account snapshot + model discovery in @singularity-forge/google-gemini-cli-provider - Add `snapshotGeminiCliAccount(cwd?)` returning { projectId, userTierId, userTierName, paidTier, models } where `models[]` carries each modelId with usedFraction, remainingFraction, and resetTime. Built on the same setupUser + CodeAssistServer.retrieveUserQuota path usage-bar.js already uses, but extracted to the dedicated package so any consumer (model picker, capacity diagnostics, catalog cache) can call one helper. - Add `discoverGeminiCliModels(cwd?)` as a thin "just the IDs" wrapper. - Both are best-effort: any failure (OAuth expired, no project, network) returns null silently — never throws. 2. SF-side cache writer at src/resources/extensions/sf/gemini-catalog.js - Delegates discovery to the package; only handles cache file path, 6-hour TTL, and the session_start lifecycle hook. - Cache lands at .sf/runtime/model-catalog/google-gemini-cli.json with the same shape as the generic model-catalog-cache, so getKnownModelIds and the model picker pick it up transparently. - Wired into bootstrap/register-hooks.js session_start in parallel with the existing scheduleModelCatalogRefresh (the generic REST + API-key path can't reach gemini-cli's OAuth-only Code Assist endpoint). 3. Capacity error classification fix - error-classifier.js SERVER_RE now matches "no capacity (available\|left)", "capacity (unavailable\|exhausted)", and "no capacity ... on the server". Previously these fell through to kind=unknown, which is not transient, so agent-end-recovery never retried — even though the same handler already caps gemini-cli rate-limit backoff at 30s for exactly this class of transient. With the pattern matched as `server`, the existing retry-with-backoff path covers it. The full extension test suite (1386 tests) passes. Typecheck clean for both the package and the SF extensions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 03:32:35 +02:00

1 2 3 4 5 ...

840 commits