# Reliability ## Exit Codes (machine surface) `sf headless` is the current machine-surface command. These codes describe the non-interactive runner and are independent from output format: text, one JSON result, and streaming JSONL use the same completion semantics. | Code | Meaning | |------|---------| | 0 | Success — unit or session completed cleanly | | 1 | Error or timeout | | 10 | Blocked — LLM called an interactive tool that requires user input; parent must respond or abort | | 11 | Cancelled — SIGINT or SIGTERM received | | 12 | Reload — agent requested restart-with-resume on the same session | ## Failure Modes and Recovery ### Process crash mid-unit **Detection:** Lock file in `.sf/` is present on next launch; RPC child process is gone. **Recovery path (`src/resources/extensions/sf/auto-recovery.ts`):** 1. Read the surviving session JSONL from `~/.sf/sessions//` 2. Synthesize a recovery briefing from every tool call recorded on disk 3. Resume the LLM mid-unit with the briefing as context — no state is lost 4. If the session JSONL is unreadable, fall back to starting the unit fresh ### Timeout **Detection:** Machine-surface parent receives no heartbeat within `HEADLESS_HEARTBEAT_INTERVAL_MS` (60 000 ms), or the unit wall-clock exceeds the configured timeout. **Recovery path:** `auto-timeout-recovery.ts` writes a timeout summary, marks the unit `needs_fix`, and advances the loop. The parent exits with code 1 unless `--max-restarts` allows a retry. ### Stuck detection (repeating-pattern loops) **Detection (`src/resources/extensions/sf/auto-stuck-detection.ts`):** Sliding-window analysis over the last ~10 unit results. If the same A→B→A→B pattern repeats, the loop is classified as stuck. **Recovery path:** Retry once with a deep diagnostic prompt that shows the pattern. If still stuck, stop and surface the exact expected file for human inspection. Stuck state persists across session restarts. ### Provider API errors (transient) **Detection:** `bootstrap/provider-error-resume.ts` intercepts 429, 500, 503 responses. **Recovery path:** Exponential backoff; re-queue the unit. If a provider is consistently unavailable, route to the configured fallback model. ### Verification gate failures **Detection:** `auto-verification.ts` runs lint/test after each task; non-zero exit = failure. **Recovery path:** Auto-retry the task up to 2× with the agent receiving full command output as context. After 2 failures the task is marked `needs_fix` and the loop advances with a warning. ### Budget ceiling hit **Detection:** `auto-budget.ts` tracks cumulative dollar cost; emits warnings at 75%, 80%, 90%, and halts at 100%. **Recovery path:** Auto-mode pauses; user must explicitly approve resumption. The current unit is not retried. ## Restart Loop (machine surface) `sf headless autonomous --max-restarts 3` applies exponential backoff: 5 s → 10 s → 30 s (cap). After exhausting restarts the parent exits with code 1. Each restart resumes via crash recovery above. ## Observability | Signal | Location | |--------|----------| | Structured trace | `.sf/traces/trace-.json` — full session span tree with tokens, cost, duration | | Event audit log | `.sf/event-log.jsonl` — every unit completion, tool call, decision save (v2 format) | | Desktop notifications | OS-native; configurable via preferences (`notifications.*`) | | Stderr progress | Human-readable machine-surface progress goes to stderr; stdout carries the batch JSON result for `--output-format json` or JSONL events for `--output-format stream-json` | | Heartbeat | Emitted every 60 s to detect hung parent/child communication | ## Release Checks Before shipping a build: ```bash just test # full unit test suite just smoke-test # sf --version, sf --help, sf --print just typecheck # tsc extensions, no emit just lint # eslint ```