singularity-forge/docs/RELIABILITY.md
2026-05-05 15:42:10 +02:00

3.5 KiB
Raw Blame History

Reliability

Exit Codes (headless mode)

Code Meaning
0 Success — unit or session completed cleanly
1 Error or timeout
10 Blocked — LLM called an interactive tool that requires user input; parent must respond or abort
11 Cancelled — SIGINT or SIGTERM received
12 Reload — agent requested restart-with-resume on the same session

Failure Modes and Recovery

Process crash mid-unit

Detection: Lock file in .sf/ is present on next launch; RPC child process is gone.

Recovery path (src/resources/extensions/sf/auto-recovery.ts):

  1. Read the surviving session JSONL from ~/.sf/sessions/<session-id>/
  2. Synthesize a recovery briefing from every tool call recorded on disk
  3. Resume the LLM mid-unit with the briefing as context — no state is lost
  4. If the session JSONL is unreadable, fall back to starting the unit fresh

Timeout

Detection: Headless parent receives no heartbeat within HEADLESS_HEARTBEAT_INTERVAL_MS (60 000 ms), or the unit wall-clock exceeds the configured timeout.

Recovery path: auto-timeout-recovery.ts writes a timeout summary, marks the unit needs_fix, and advances the loop. The parent exits with code 1 unless --max-restarts allows a retry.

Stuck detection (repeating-pattern loops)

Detection (src/resources/extensions/sf/auto-stuck-detection.ts): Sliding-window analysis over the last ~10 unit results. If the same A→B→A→B pattern repeats, the loop is classified as stuck.

Recovery path: Retry once with a deep diagnostic prompt that shows the pattern. If still stuck, stop and surface the exact expected file for human inspection. Stuck state persists across session restarts.

Provider API errors (transient)

Detection: bootstrap/provider-error-resume.ts intercepts 429, 500, 503 responses.

Recovery path: Exponential backoff; re-queue the unit. If a provider is consistently unavailable, route to the configured fallback model.

Verification gate failures

Detection: auto-verification.ts runs lint/test after each task; non-zero exit = failure.

Recovery path: Auto-retry the task up to 2× with the agent receiving full command output as context. After 2 failures the task is marked needs_fix and the loop advances with a warning.

Budget ceiling hit

Detection: auto-budget.ts tracks cumulative dollar cost; emits warnings at 75%, 80%, 90%, and halts at 100%.

Recovery path: Auto-mode pauses; user must explicitly approve resumption. The current unit is not retried.

Restart Loop (headless daemon mode)

sf headless autonomous --max-restarts 3 applies exponential backoff: 5 s → 10 s → 30 s (cap). After exhausting restarts the parent exits with code 1. Each restart resumes via crash recovery above.

Observability

Signal Location
Structured trace .sf/traces/trace-<timestamp>.json — full session span tree with tokens, cost, duration
Event audit log .sf/event-log.jsonl — every unit completion, tool call, decision save (v2 format)
Desktop notifications OS-native; configurable via preferences (notifications.*)
Stderr progress All headless output goes to stderr; stdout carries JSON result when --output-format json
Heartbeat Emitted every 60 s to detect hung parent/child communication

Release Checks

Before shipping a build:

just test          # full unit test suite
just smoke-test    # sf --version, sf --help, sf --print
just typecheck     # tsc extensions, no emit
just lint          # eslint