72 lines
3.5 KiB
Markdown
72 lines
3.5 KiB
Markdown
# Reliability
|
||
|
||
## Exit Codes (headless mode)
|
||
|
||
| Code | Meaning |
|
||
|------|---------|
|
||
| 0 | Success — unit or session completed cleanly |
|
||
| 1 | Error or timeout |
|
||
| 10 | Blocked — LLM called an interactive tool that requires user input; parent must respond or abort |
|
||
| 11 | Cancelled — SIGINT or SIGTERM received |
|
||
| 12 | Reload — agent requested restart-with-resume on the same session |
|
||
|
||
## Failure Modes and Recovery
|
||
|
||
### Process crash mid-unit
|
||
**Detection:** Lock file in `.sf/` is present on next launch; RPC child process is gone.
|
||
|
||
**Recovery path (`src/resources/extensions/sf/auto-recovery.ts`):**
|
||
1. Read the surviving session JSONL from `~/.sf/sessions/<session-id>/`
|
||
2. Synthesize a recovery briefing from every tool call recorded on disk
|
||
3. Resume the LLM mid-unit with the briefing as context — no state is lost
|
||
4. If the session JSONL is unreadable, fall back to starting the unit fresh
|
||
|
||
### Timeout
|
||
**Detection:** Headless parent receives no heartbeat within `HEADLESS_HEARTBEAT_INTERVAL_MS` (60 000 ms), or the unit wall-clock exceeds the configured timeout.
|
||
|
||
**Recovery path:** `auto-timeout-recovery.ts` writes a timeout summary, marks the unit `needs_fix`, and advances the loop. The parent exits with code 1 unless `--max-restarts` allows a retry.
|
||
|
||
### Stuck detection (repeating-pattern loops)
|
||
**Detection (`src/resources/extensions/sf/auto-stuck-detection.ts`):** Sliding-window analysis over the last ~10 unit results. If the same A→B→A→B pattern repeats, the loop is classified as stuck.
|
||
|
||
**Recovery path:** Retry once with a deep diagnostic prompt that shows the pattern. If still stuck, stop and surface the exact expected file for human inspection. Stuck state persists across session restarts.
|
||
|
||
### Provider API errors (transient)
|
||
**Detection:** `bootstrap/provider-error-resume.ts` intercepts 429, 500, 503 responses.
|
||
|
||
**Recovery path:** Exponential backoff; re-queue the unit. If a provider is consistently unavailable, route to the configured fallback model.
|
||
|
||
### Verification gate failures
|
||
**Detection:** `auto-verification.ts` runs lint/test after each task; non-zero exit = failure.
|
||
|
||
**Recovery path:** Auto-retry the task up to 2× with the agent receiving full command output as context. After 2 failures the task is marked `needs_fix` and the loop advances with a warning.
|
||
|
||
### Budget ceiling hit
|
||
**Detection:** `auto-budget.ts` tracks cumulative dollar cost; emits warnings at 75%, 80%, 90%, and halts at 100%.
|
||
|
||
**Recovery path:** Auto-mode pauses; user must explicitly approve resumption. The current unit is not retried.
|
||
|
||
## Restart Loop (headless daemon mode)
|
||
|
||
`sf headless autonomous --max-restarts 3` applies exponential backoff: 5 s → 10 s → 30 s (cap). After exhausting restarts the parent exits with code 1. Each restart resumes via crash recovery above.
|
||
|
||
## Observability
|
||
|
||
| Signal | Location |
|
||
|--------|----------|
|
||
| Structured trace | `.sf/traces/trace-<timestamp>.json` — full session span tree with tokens, cost, duration |
|
||
| Event audit log | `.sf/event-log.jsonl` — every unit completion, tool call, decision save (v2 format) |
|
||
| Desktop notifications | OS-native; configurable via preferences (`notifications.*`) |
|
||
| Stderr progress | All headless output goes to stderr; stdout carries JSON result when `--output-format json` |
|
||
| Heartbeat | Emitted every 60 s to detect hung parent/child communication |
|
||
|
||
## Release Checks
|
||
|
||
Before shipping a build:
|
||
|
||
```bash
|
||
just test # full unit test suite
|
||
just smoke-test # sf --version, sf --help, sf --print
|
||
just typecheck # tsc extensions, no emit
|
||
just lint # eslint
|
||
```
|