singularity/singularity-forge

Author	SHA1	Message	Date
Mikael Hugo	f8e53840da	fix(rpc, web): integrate drain into forceShutdown + healthz-503 on shutdown Three fixes addressing codex's adversarial review of the earlier orphan- recovery / graceful-shutdown landing: (1) Codex point B — single shutdown path. Removed the parallel installGracefulShutdown() handler in rpc-mode.ts that was adding a second SIGTERM listener and racing forceShutdown()'s teardown. The drain is now the FIRST step inside forceShutdown() (before killTrackedDetachedChildren / extension session_shutdown / etc.) so DB writes complete cleanly while child processes are still alive to flush. Race-free against the existing shutdown ordering. (2) Codex point D — recovery-before-each-drain. Cloud-volume mtime visibility lags between containers can mean an orphan `.draining` file from a previous container isn't visible during the startup scan but appears moments later. drainQueuedSfFeedbackCommands() now runs recoverOrphanedFeedbackDrains() as its first step, so each dispatch's drain sees the latest filesystem state. (3) Codex point E — healthz returns 503 during shutdown. New module src/web/shutdown-state.ts holds a per-process flag, auto-registers SIGTERM/SIGINT/SIGHUP handlers on first read, and exposes a snapshot (signal, startedAt, elapsedMs) for diagnostics. The healthz route imports isShuttingDown() and returns 503 when set, so k8s readinessProbe / Forgejo blue-green probes drain traffic BEFORE we actually stop responding. Tests: - rpc-mode-orphan-recovery.test.ts: 8/8 still green - web-shutdown-state.test.ts: 5/5 new — default false, mark sets flag, idempotent, signal exposed via snapshot, null signal for manual mark Deferred to a follow-up commit (codex didn't flag, but noted for completeness): a SIGTERM-drain child-process integration test that spawns rpc-mode + sends a real signal. The 5 unit tests cover the flag logic; the integration test would cover the full process tree and is bulkier than the current commit warrants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 22:35:50 +02:00
Mikael Hugo	68178a9260	fix(rpc-test): use .js extension for recovery module import tsgo rejects `.ts` extensions in imports without allowImportingTsExtensions. Updated the test to import from "./feedback-queue-recovery.js" which is both ESM-compatible and matches the rest of the package convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 22:30:10 +02:00
Mikael Hugo	d54f18c95f	feat(rpc): orphan-recovery + 10-min graceful shutdown for safe container swap Two related changes to make blue/green upgrades (per scripts/upgrade-vega- source-server.mjs) safe for in-flight self-feedback writes. 1. Startup orphan recovery (feedback-queue-recovery.ts, extracted module). Scans .sf/runtime/ for sf-feedback-queue.jsonl.<pid>(.<sid>)?.draining files left by previous processes. For each: - if our own session id: leave alone (live drain) - if PID is alive: leave alone (foreign drainer) - else: rename back to queue (only if no active queue file exists) Crash safety: when both an orphan AND an active queue exist, we DEFER recovery rather than merge — appending then unlinking would risk duplicate replay on crash. The next restart's recovery picks it up once the queue is naturally drained. Supports legacy filenames (.<pid>.draining, pre-session-id) for backward compat. Added SF_DRAIN_SESSION_ID (per-process 6-byte hex) stamped into the .draining filename. PID reuse across container restarts is normally safe because /proc clears, but the session id is a stronger guarantee that we don't trample a foreign drainer that happens to land on the same PID. 2. SIGTERM/SIGINT drain-then-exit handler (installGracefulShutdown). Drains the queue once on signal, then exits. Bounded by SF_RPC_SHUTDOWN_GRACE_MS (default 600_000 = 10 min). Rationale: if a drain is in flight, it MUST finish — losing self-feedback writes across a server upgrade is worse than a long wait. Normal drains complete in <1s; the 10-min ceiling is for pathological lock contention. Operator overrides via env var, or docker kill / kubectl delete --force for hard bypass. Upgrader script bumped to docker stop --timeout 610 (10s safety margin past the grace). k8s deployments must set terminationGracePeriodSeconds≥610 for the rolling-update path. Tests: rpc-mode-orphan-recovery.test.ts — 7 cases covering empty, no-orphans, dead-PID single recovery, both-files-deferred (codex's crash-safety fix), live-PID untouched, multiple-dead-PIDs, malformed- filename ignored. Refs sf-mpa5kdpu (drainer orphans never recovered), sf-mpa4g46x (original RPC hang). Codex adversarial-reviewed; the PID-reuse hardening and crash-safety deferral landed per its feedback. Open follow-ups: shutdown-aware /api/healthz returning 503 (codex point E), integrate with existing forceShutdown ordering (codex point C). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 22:29:24 +02:00
Mikael Hugo	55a498603f	fix(rpc): don't unref the sf-feedback drain timer The drainer was scheduled via setTimeout(0) with timer.unref(). The unref made the timer release-eligible — fine in a long-running rpc-mode child where the process has plenty of other event-loop handles, but fatal in the packaged-standalone path where the rpc subprocess has nothing else to keep it alive. The process exited before the timer fired, so the queue file was renamed to .<pid>.draining and then stranded forever. Removed timer.unref(). The setTimeout(0) still lets the RPC response go back to the caller first (no synchronous blocking on the drain), but the timer now keeps the process alive until the drain handler runs, and the drain's own async I/O keeps it alive until done. Refs sf-mpa6wuhm-wwddd1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 21:55:23 +02:00
Mikael Hugo	acd907fec2	fix: harden sf server control loop Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details	2026-05-17 21:13:12 +02:00
Mikael Hugo	cf2d1a768e	feat(sf): route server control through rpc	2026-05-17 20:07:36 +02:00
Mikael Hugo	cc32ab79d9	fix(docs): remove stale hashline-{read,edit}.ts rows post-fold Hashline read/edit tool wrappers were folded into Edit({match}) and Read({format}) modes in commit `ffdec0fee`. The two rows in FILE-SYSTEM-MAP.md pointed to files that no longer exist. Updated the surviving hashline.ts row to note its new consumer relationship with Edit/Read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 18:48:34 +02:00
Mikael Hugo	623af869b1	remove: SF voice IVR / ElevenLabs paging — migrated to centralcloud Per operator-direction 2026-05-17 (R089 — Migrate Voice IVR / ElevenLabs On-Call Paging Infrastructure out of SF). Migration target landed in centralcloud monorepo: - centralcloud_core/lib/centralcloud_core/voice.ex (TwiML + ElevenLabs) - centralcloud_staff/lib/.../controllers/voice_controller.ex (Phoenix) - centralcloud_staff/lib/.../controllers/voice_prompt_controller.ex - centralcloud_staff/lib/.../router.ex (/twilio scope) SF removal: - web/app/api/voice/route.ts - web/app/api/voice/prompt/route.ts - web/app/api/voice/ directory - src/tests/integration/web-voice-ivr-contract.test.ts Operator-paging infra was historical drift in SF (per-project compiler); belongs in centralcloud (org-level ops). R088 (Pre-Removal Test-Import Safety Gate) not yet built — operator manually verified safety scan: TWILIO_/ELEVENLABS_ env vars only referenced in the deleted files; no internal SF callers; centralcloud version verified present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 17:42:16 +02:00
Mikael Hugo	ffdec0feee	fold: hashline_edit + hashline_read → Edit({match}) + Read({format}) modes Per operator R-entry sf-mp9wo7e3-sdxqss + no-compat directive. - Edit gains `match: "substring"\|"anchor"` arg; anchor mode routes to the existing applyHashlineEdits logic. Substring stays default. - Read gains `format: "plain"\|"tagged"` arg; tagged mode emits LINE#HASH prefixes via formatHashLines. - Delete hashline-edit.ts, hashline-read.ts. KEEP hashline.ts (helpers are now Edit/Read internals). - tools/index.ts: drop the two tools + the createHashlineCodingTools preset. - agent-session.ts: setEditMode no longer swaps tool instances (single tool surface; mode preserved for system-prompt context only). - sdk.ts + index.ts: remove hashline tool re-exports. - headless-ui.ts + test: remove hashline_edit case. Net agent-visible tool surface: -2 tools. Capability preserved as modes. No backward-compat alias for the removed tool names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:39:59 +02:00
Mikael Hugo	d03758d803	feat: replace launchd with systemd user-unit install path Operator-direction 2026-05-17 "we will never use mac" — no compat preservation. Single-cutover replacement. - new packages/daemon/src/systemd.ts: install/uninstall/status using systemctl --user + ~/.config/systemd/user/sf-server.service - new packages/daemon/src/systemd.test.ts: ports launchd tests, same shape, mocked systemctl via RunCommandFn injection + SF_SYSTEMD_USER_DIR env override for real filesystem tests - cli-main.ts: switch import + update help text + status messages - index.ts: re-export systemd module (installSystemdUnit, uninstallSystemdUnit, systemdUnitStatus, generateUnit, getServicePath, SystemdStatus, SystemdUnitOptions) - DELETED: launchd.ts (253 LOC), launchd.test.ts (379 LOC) - docs/dev/drafts/M053-per-repo-supervisor.md: remove "launchd" mention - CHANGELOG.md: document systemd-only install path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-17 17:33:34 +02:00
Mikael Hugo	57fef5979d	feat: make sf server the operator entrypoint	2026-05-17 17:23:46 +02:00
Mikael Hugo	6e3b3d3c54	feat: add Serena-style AST tools (ReplaceSymbol, InsertAroundSymbol, AstGrep) Wraps the native AST primitives from @singularity-forge/native/{edit,ast} as LLM tools so agents can do tree-sitter-anchored code edits instead of substring-based Edit or line-anchor hashline. - replace-symbol.ts (+117): wraps replaceSymbol(file, symbolPath, newBody); matches function/class/method declarations via tree-sitter, returns matched=false sentinel when the symbol isn't located. - insert-around-symbol.ts (+122): wraps insertAroundSymbol with position enum BeforeDecl/AfterDecl/AtBodyStart/AtBodyEnd. - ast-grep.ts (+152): wraps astGrep for pattern matching across files with $VAR/$$$ARGS meta-variables; returns ranked matches with byte/line/column + captured meta-variable bindings. Each tool: - typebox schema matching the existing AgentTool pattern (edit.ts) - notifyFileChanged() into the LSP layer on write ops - resolveToCwd() for path normalization - catches native errors + returns isError result with the NativeUnavailableError message pointing operators to `nix develop` + `node rust-engine/scripts/build.js --dev` Wire-in: - tools/index.ts: re-exports + imports + entries in `allTools` map and createAllTools() factory. - extension-manifest.json: ReplaceSymbol / InsertAroundSymbol / AstGrep appended to provides.tools so SF extension agents see them. Higher value than substring/line-anchor for code in tree-sitter-supported languages (TS/JS/TSX/Python/Rust). Edit + hashline remain for non-code files. PascalCase names per the Claude-Code-aligned convention from sf-mp9w20y1-nld9hc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 17:14:12 +02:00
Mikael Hugo	19b10eb67c	feat: make sf-server own swarm registry sync	2026-05-17 17:05:16 +02:00
Mikael Hugo	9a84d82cdb	chore(release): 2.75.3 → 2.75.4 + workspace dependency refresh Bumps version across the workspace (root + 10 @singularity-forge/* packages) and lands the pending dependency refresh that had been sitting uncommitted: @anthropic-ai/sdk 0.95.1 → 0.96.0 @anthropic-ai/vertex-sdk 0.14.4 → 0.16.0 @google/genai 2.0 → 2.3 @logtape/{file,logtape,pretty,redaction} 2.0.7 → 2.0.9 @smithy/node-http-handler 4.7.0 → 4.7.3 @clack/prompts 1.3 → 1.4 @types/mime-types 2.1 → 3.0 Inter-package refs in packages/{daemon,ai}/package.json bumped to ^2.75.4 so the workspace stays self-consistent. package-lock.json regenerated via `npm install --package-lock-only --legacy-peer-deps`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 23:59:14 +02:00
Mikael Hugo	f55d490e1d	fix(subagent-runner): drop spurious 10s STUCK warning on session.prompt The phaseWatchdog at 10s fired "STUCK phase=session.prompt" on every healthy LLM call longer than 10 seconds. Verified via strace on the running dogfood sf: bytes were actively flowing on the TLS socket (fd 29) to the LLM provider while STUCK was being logged — the session.prompt was never actually stuck, the watchdog was just diagnostic-only and oblivious to stream activity. The noOutputTimeoutMs watchdog (set to 60s for triage in commit `d80060fec`) is the actual kill mechanism. It is already event-aware: every meaningful subagent event resets the timer via armNoOutputTimer + isMeaningfulSubagentOutputEvent. The 10s STUCK warning was added in commit `67e5ac9db` as investigation infrastructure for the sf-mp8e02m1-zpk903 family of bugs, but now it is just noise that makes legitimate 30-200s LLM responses look broken. Keeps the 10s STUCK watchdog for the three setup phases (resourceLoader.reload, createAgentSession, bindExtensions) where 10s of silence is a real hang signal — those phases normally run in sub-second. Also includes: - biome.json: bump $schema URL from 2.4.14 to 2.4.15 to match the current biome CLI (clears the deserialize warning) - scripts/check-test-imports.{,test.}mjs: format + drop a useless regex escape that biome flagged in landed code Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 23:49:43 +02:00
Mikael Hugo	365c6bbc3b	chore: formatter / linter touch-up (230 files) Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details Pure formatting / lint-fix pass that ran during `npm run build:core` in the session that landed the agent-runner / quota / coverage / phase-2 routing work. No logic changes — indentation, trailing commas, import sort, etc. Captured separately so the actual feature commits stay scoped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 21:19:53 +02:00
Mikael Hugo	67e5ac9db1	diag(subagent-runner): per-phase timing + stuck-watchdog for sf-mp8e02m1-zpk903 Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details Adds visible diagnostics to runSubagent so the next time the "session initialized but no LLM call" bug fires, the log identifies which setup phase hangs. Phases instrumented: - resourceLoader.reload() - createAgentSession() - bindExtensions(runLifecycle=...) - session.prompt() entry → return Output format (stderr, prefixed with [subagent:<name>]): phase=resourceLoader.reload 23ms phase=createAgentSession 142ms phase=bindExtensions 89ms runLifecycle=true phase=session.prompt-entered taskLen=8421 timeoutMs=480000 noOutputMs=180000 phase=session.prompt-returned 16234ms ← normal completion STUCK phase=<X> 10000ms (no completion signal ...) ← when watchdog fires Each phase has a soft 10s watchdog that emits a STUCK line if the await doesn't complete in time. The watchdog never aborts — just surfaces visibility. Existing timeoutMs / noOutputTimeoutMs handle actual termination. This is investigation infrastructure for the third prompt-never-sent seam (coding-agent/subagent-runner). The agent-runner.js seam (sf-mp8g4rcd-w01tkh) was fixed in commit `8ee4d8358` with bounded retries. This commit doesn't fix the underlying bug — it makes the bug self-reporting next time it fires so operator and autonomous loop both get actionable signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:40:17 +02:00
Mikael Hugo	da0c41d375	sf snapshot: uncommitted changes after 56m inactivity	2026-05-16 14:59:40 +02:00
Mikael Hugo	e2e096c5c7	feat(rpc): configurable RPC init timeout via SF_RPC_INIT_TIMEOUT_MS Add resolveRpcInitTimeoutMs() helper and wire it into RpcClient.init(). Default init timeout increased from 30s to 120s. Override via env var. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 20:00:26 +02:00
Mikael Hugo	3a14fe86a7	test(list-models): isolate from developer's discovery-cache Tests were picking up the developer's real ~/.sf/agent/discovery-cache.json and seeing unexpected models in output. Pin tests to a guaranteed-missing path via the new _discoveryCacheFilePath option so the env they observe is solely what the test constructs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 16:37:11 +02:00
Mikael Hugo	7ba469cff1	feat(memory): add debug logging to memory extraction pipeline Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details The memory extraction system has infrastructure (DB tables, LLM prompts, unit closeout wiring, embedding backfill) but zero processed units and only self-feedback-resolution memories. This suggests extraction is failing silently. Add debugLog() calls throughout extractMemoriesFromUnit() so we can observe: - Skip reasons (mutex busy, rate limited, already processed, file too small) - Start/done lifecycle per unit - LLM call and parse outcomes - Error messages on failure and retry This makes the extraction pipeline observable via --debug or the journal/debug log without changing behavior. Tests: 185 files / 1993 tests pass. Type check: clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 16:09:36 +02:00
Mikael Hugo	d57cd84d9a	fix(auto): make halt watchdog observable	2026-05-15 08:09:02 +02:00
Mikael Hugo	f9c147a08b	fix(swarm): ignore heartbeats for silent worker timeout	2026-05-15 08:00:35 +02:00
Mikael Hugo	e464a1bd6e	fix(swarm): bound silent worker responses	2026-05-15 07:35:31 +02:00
Copilot	cf9203aee0	feat(swarm): forward parent permission profile to in-process worker sessions Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details In-process swarm workers get a fresh headless AgentSession whose permission extension defaults to read-only minimal. This blocks normal autonomous edits (e.g., write_file, edit) even when the parent session runs at normal or trusted level. - run-unit.js: add legacyPermissionLevelForProfile mapping and include executorPermissionLevel in the dispatch envelope. - swarm-dispatch.js: forward executorPermissionLevel from envelope to runAgentTurn as permissionLevel. - agent-runner.js: accept permissionLevel option and pass it to runSubagent config. - subagent-runner.ts: add permissionLevel to SubagentConfig; when set, temporarily set SF_PERMISSION_LEVEL env and run extension lifecycle so the permission extension reads the level before tool hooks execute. - Tests for envelope field, dispatch forwarding, and run-unit integration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 06:38:42 +02:00
Mikael Hugo	dbfaca61cf	fix(swarm): surface worker tool call count to bypass parent-ledger guard Some checks are pending CI / detect-changes (push) Waiting to run Details CI / docs-check (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / build (push) Blocked by required conditions Details CI / integration-tests (push) Blocked by required conditions Details CI / windows-portability (push) Blocked by required conditions Details CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions Details CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions Details CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions Details Round 7 dogfood failed with "0 tool calls — context exhaustion" even though the swarm worker's session DID call tools. Root cause: the phases-unit.js zero-tool-call guard reads from the PARENT session's message ledger via snapshotUnitMetrics. The swarm worker runs in an ISOLATED subagent session — its tool calls never appear in the parent's messages, so the guard always sees 0 and fires a false- positive context-exhaustion retry. Fix: - runUnitViaSwarm now returns swarmToolCallCount on the UnitResult, surfacing the real worker tool call count from the onEvent stream (collectedToolCalls.length, accurate end-to-end). - phases-unit.js zero-tool-call guard checks unitResult._via === "swarm" && swarmToolCallCount > 0 and bypasses the false-positive retry, logging "zero-tool-calls-swarm-bypass". Also adds a debug stderr line in subagent-runner.ts printing the tool count after bindExtensions, confirming the worker session HAS the full tool set (checkpoint + built-ins) — Hypotheses 1 and 2 from the Round 8 brief ruled out by direct observation. Tests: 3 new (swarmToolCallCount = 0 / N / 1-on-checkpoint-only); 2518 tests pass total, 0 regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 05:46:17 +02:00
Mikael Hugo	46d9d45279	fix(bash): block wrong project python runtime	2026-05-15 05:33:28 +02:00
Mikael Hugo	1115437cec	feat(swarm): event streaming + outcome derivation for runUnitViaSwarm - Forward onEvent through swarm-dispatch → agent-runner → runSubagent - Collect toolcall_end events in runUnitViaSwarm to build real tool-use blocks - Detect checkpoint tool outcome for accurate unit completion signal - Add headless.ts graceful shutdown (async signal handler, 2.5s timeout) - RPC client stop() now awaits flush and propagates stop to child sessions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 04:54:58 +02:00
Mikael Hugo	903cdd4d9d	feat(subagent): event streaming for in-process runSubagent Add RunSubagentOptions.onEvent callback so callers (TUI live update panel for /delegate, /rubber-duck, etc.) get every session event without polling. Errors from the callback are caught so a buggy caller cannot crash the agent. Chain caller-supplied AbortSignal through a local AbortController in runSingleAgent and register it in a new liveSubagentControllers set so stopLiveSubagents aborts in-process subagents alongside the legacy spawn-based processes (cmux split, sift codebase_search). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 04:04:52 +02:00
Mikael Hugo	62f886430c	fix: run subagents in process by default	2026-05-15 03:59:34 +02:00
Mikael Hugo	8b0f0bbd65	fix: harden headless dogfood self-healing	2026-05-15 03:53:15 +02:00
Mikael Hugo	2d5a05a48b	fix(security): resolve 7 findings from full-repo code review - Create web/middleware.ts to authenticate all API routes via bearer token and origin checks (previously unauthenticated due to missing middleware file) - Fix path traversal in browse-directories: replace startsWith with realpathSync + relative + isAbsolute containment checks - Fix XSS in session HTML export: escape raw HTML blocks via marked renderer - Fix PTY process leak: destroy session on SSE stream cancellation - Fix unhandled exception in terminal sessions POST: wrap getOrCreateSession in try/catch with structured JSON error response - Fix silent child-process failure in headless dispatch: add exit handler to write failed claim when sf headless triage exits non-zero - Fix TypeError on malformed claim JSON: add Array.isArray guard before accessing claim.ids.length All changes type-check cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 02:18:43 +02:00
Mikael Hugo	def1edefa9	sf snapshot: uncommitted changes after 268m inactivity	2026-05-15 02:08:06 +02:00
Mikael Hugo	2e4bdd292c	fix: keep hidden sf commands callable in print mode	2026-05-14 21:25:18 +02:00
Mikael Hugo	f88b48b0aa	fix: show print mode liveness	2026-05-14 20:59:19 +02:00
Mikael Hugo	487237a32c	fix: bound sf print mode and chat routing	2026-05-14 20:55:00 +02:00
Mikael Hugo	47867c1236	feat: route clear sf chat commands	2026-05-14 20:21:37 +02:00
Mikael Hugo	ab1a1edcf9	refactor: tier sf slash commands	2026-05-14 20:14:09 +02:00
Mikael Hugo	7ea41b89ae	feat(ai,coding-agent): wireModelId — provider deployment alias Adds an optional wireModelId field to the Model interface and a resolveWireModelId helper. Forge's canonical model.id stays stable for selection, capability scoring, policy, and history; providers now send model.wireModelId on the wire when set, model.id otherwise. Use cases: Azure deployment names, vendor model slugs that differ from Forge's canonical identity, A/B routing where the operator wants canonical history but a specific deployment. Wired through every provider in @singularity-forge/ai (anthropic, amazon-bedrock, azure-openai-responses, google, google-vertex, google-gemini-cli, mistral, openai-codex-responses, openai-completions, openai-responses) plus @singularity-forge/coding-agent's ModelRegistry (model definitions + per-model overrides). Tests: openai-completions wireModelId payload coverage + model-registry-auth-mode coverage for the override + definition fields. Full pi-ai + coding-agent suite: 956/956 ✓ (7 unrelated skipped). This realizes the model-registry contract drafted in `1d753af6b`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 09:25:21 +02:00
Mikael Hugo	ca7368e5f1	fix(bash): add 120s default timeout to prevent autonomous mode hangs - Add BUILT_IN_DEFAULT_TIMEOUT_SECS = 120 constant to bash tool - Compute effectiveTimeout = timeout ?? resolvedDefaultTimeout so LLM calls without a timeout get the 120s guard automatically - Add defaultTimeoutSeconds? to BashToolOptions for override at creation - Dynamic bashSchemaWithDefault describes the actual default in the LLM tool description, improving model awareness - Add BashSettings interface + getBashDefaultTimeoutSeconds() to SettingsManager so users can override or disable via settings.json - Wire defaultTimeoutSeconds into agent-session.ts _buildRuntime() Root cause: npx sf --help triggered npm package download, hanging for 4+ minutes without timeout, consuming entire autonomous run budget. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-11 19:12:33 +02:00
Mikael Hugo	338c75fc6f	refactor: complete rf-01/rf-02/rf-11 blocked todos rf-01: add ECONNREFUSED to isTransientNetworkError in anthropic-shared.ts, aligning with the NETWORK_RE pattern in error-classifier.js rf-02: add scripts/validate-model-cost-table.mjs to report coverage gaps and price divergence between model-cost-table.js and models.generated.ts; add 'validate-cost-table' script to package.json rf-11: extract 10 pure resource-display utility functions from interactive-mode.ts into packages/coding-agent/src/modes/interactive/ resource-display.ts, reducing interactive-mode.ts by ~282 lines All 4375 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-11 16:45:39 +02:00
Mikael Hugo	1adc7f119c	refactor(rf-06): split auto/phases.js into per-phase modules 3538-line monolith → 6 focused modules + thin barrel: - phases-helpers.js (223 lines): shared helpers (generateMilestoneReport, closeoutAndStop, emitCancelledUnitEnd, maybeFireProductAudit, _resolveReportBasePath, recordLearningOutcomeForUnit) - phases-dispatch.js (486 lines): runDispatch + assessUokDiagnosticsDispatchGate - phases-guards.js (497 lines): runGuards + guard helpers - phases-pre-dispatch.js (760 lines): runPreDispatch - phases-unit.js (1477 lines): runUnitPhase + session timeout state - phases-finalize.js (542 lines): runFinalize - phases.js (13 lines): barrel re-export preserving original import surface Removed dead runPhaseReview export (zero callers confirmed). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-11 15:14:49 +02:00
Mikael Hugo	0b5fa75c0d	fix(lint): fix all pre-existing lint failures - check-sf-extension-inventory.mjs: expand parseDirectRegisteredCommands() scan to include 7 more files (guards/inturn.js, notifications/notify.js, permissions/index.js, ui/usage-bar.js, commands/legacy/audit.js, commands/legacy/create-extension.js, commands/legacy/create-slash-command.js) and filter results by BASE_RUNTIME_COMMAND_NAMES to exclude doc-string false positives ("name" in create-slash-command.js template text) - extension-manifest.json: remove 'clear' (subcommand of logs/notifications, never a top-level pi.registerCommand) - packages/pi-agent-core/src/db/sf-db.ts: fix 23 noVoidTypeReturn errors - openDatabase: void → boolean (caller uses return value at line 5625) - claimEscalationOverride: void → boolean (caller checks at escalation.js:243) - resolveSelfFeedbackEntry: void → boolean (caller checks at self-feedback.js:387) - copyWorktreeDb: void → boolean (caller checks at reconcileWorktreeDb) - compactUokMessages: void → {before,after} (caller returns value at message-bus.js:238) - insertSessionTurn: void → bigint\|null (caller uses id at session-recorder.js:104) - expireStaleMemories: void → number (caller uses count at auto-start.js:1047) - deleteMemorySourceRow: void → boolean (caller returns value at memory-source-store.js:107) - deleteMemoryEmbedding: void → boolean (caller returns value at memory-embeddings.js:328) - updateBacklogItemStatus: remove dead return expression (callers discard value) - removeBacklogItem: remove dead return expression (callers discard value) - updateGateCircuitBreaker: remove dead return {total,avgMs,...} (wrong-type code accidentally merged from getGateLatencyStats, never reachable) - markUokMessageRead: remove dead return true/false (callers discard value) - Auto-fix formatting and organizeImports in ~30 source files (biome --write) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-11 04:02:31 +02:00
Mikael Hugo	e50321b62b	feat(selection): thread unitType + failure_mode into fallback outcome records - FallbackResolver.setUnitContext() stores {unitType,unitId} from autonomous dispatch - run-unit.js calls pi.setFallbackUnitContext() before/after each unit - _findAnyAvailableFallback uses real unitType/unitId from context, not sentinel - Schema v59: failure_mode column in llm_task_outcomes - insertLlmTaskOutcome accepts failure_mode (rate_limit, quota_exhausted, auth_error) - register-hooks.js passes event.classification.reason as failure_mode - register-hooks.js uses real event.unitId when available - ExtensionRuntimeActions.setFallbackUnitContext added to pi API surface Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-10 23:14:22 +02:00
Mikael Hugo	009651e86f	feat(selection): wire before_model_select into FallbackResolver for outcome-aware fallback When a model fails and FallbackResolver picks a replacement, it now: 1. Fires the before_model_select hook with reason='fallback' and the failing model's ID — the learning system records the failure outcome and returns the best Bayesian-blended replacement from llm_task_outcomes 2. Falls back to the existing heuristic sort (reasoning + context window) if the hook is unavailable or returns no override Changes: - BeforeModelSelectEvent: add optional currentModelId and reason fields - FallbackResolver: accept emitBeforeModelSelect in constructor; make _findAnyAvailableFallback async; fire hook before heuristic fallback - agent-session.ts: inject lazy emitBeforeModelSelect closure into resolver - register-hooks.js: record failure outcome when reason='fallback' before returning selectLearnedModel result Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-10 23:05:33 +02:00
Mikael Hugo	fb1bd3e5fa	refactor(shared): deduplicate shared/ utilities against coding-agent package exports - Add packages/coding-agent/src/utils/format.ts as the canonical source for formatDuration, formatTokenCount, truncateWithEllipsis, sparkline, formatDateShort, fileLink, stripAnsi, normalizeStringArray — all already exported from @singularity-forge/coding-agent via index.ts. - Convert shared/format-utils.js to a compatibility shim that re-exports the 8 functions from @singularity-forge/coding-agent. All 13 importers continue to work with no import changes required. - Convert shared/path-display.js to a compatibility shim that re-exports toPosixPath from @singularity-forge/coding-agent. Implementation in packages/coding-agent/src/utils/path-display.ts was already canonical. - shared/frontmatter.js is intentionally NOT shimmed: splitFrontmatter/ parseFrontmatterMap have a different API from the package's parseFrontmatter/ stripFrontmatter (flat-map vs {frontmatter, body} object). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-10 22:41:03 +02:00
Mikael Hugo	7227912a29	perf(search): move web-search provider injection from extension hook to native middleware - Create packages/coding-agent/src/core/providers/web-search-middleware.ts with WebSearchMiddleware class: injects web_search tool, enforces session budget (#1309), strips thinking blocks from history, and respects PREFERENCES.md search_provider. - Wire webSearchMiddleware.applyToPayload into sdk.ts onPayload callback (before extension hook dispatch) so injection runs as compiled TypeScript with zero jiti-dispatch overhead. - Export WebSearchMiddleware, webSearchMiddleware singleton, setPreferBraveResolver, CUSTOM_SEARCH_TOOL_NAMES, MAX_NATIVE_SEARCHES_PER_SESSION, and stripThinkingFromHistory from @singularity-forge/coding-agent so the extension can delegate to the same instance. - Refactor search-the-web/native-search.js: remove self-contained injection logic; import and delegate before_provider_request to webSearchMiddleware singleton. Use tri-state isAnthropicProvider (null/false/true) to synthesize a provider hint when event.model is absent but model_select has already fired — prevents the model-name heuristic from wrongly injecting into Copilot claude-* requests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-10 22:37:42 +02:00
Mikael Hugo	3fba4bcb03	refactor(mcp): move MCP connection manager to packages/coding-agent/src/core/mcp/ - Create config.ts with McpServerConfig types and readMcpConfigs/getServerConfig - Create auth.ts with buildHttpTransportOpts and createCliOAuthProvider - Create connection-manager.ts with McpConnectionManager class - Create index.ts re-exporting the public API - Export McpConnectionManager and helpers from @singularity-forge/coding-agent - Rewrite mcp-client extension as thin wrapper using McpConnectionManager - Rewrite auth.js as re-export shim from @singularity-forge/coding-agent - Update test to import buildHttpTransportOpts from @singularity-forge/coding-agent Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-10 22:19:46 +02:00
Mikael Hugo	a77e1551d2	refactor(memory): consolidate memory system, remove dead code - Delete memory-backfill.js — not imported anywhere, dead code - Rename memory-sleeper.js → tool-watchdog.js — misnamed; it is a tool-output watchdog with no relation to the memory store - Collapse memory-embeddings-llm-gateway.js into memory-embeddings.js — removes the lazy-import split; loadGatewayConfigFromEnv, createGatewayEmbedFn, and rerankCandidates are now direct exports - Remove buildEmbeddingFn() dead stub (always returned null) - Enable packages/coding-agent memory extraction extension by default (memory.enabled ?? true) so session-level extraction is active - Update all import sites and tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-10 18:17:49 +02:00
Mikael Hugo	3ffd882c8c	sf snapshot: uncommitted changes after 56m inactivity	2026-05-10 17:16:30 +02:00

1 2

55 commits