singularity-forge/docs/RELIABILITY.md

77 lines
3.8 KiB
Markdown
Raw Permalink Normal View History

# Reliability
## Exit Codes (machine surface)
`sf headless` is the current machine-surface command. These codes describe the
non-interactive runner and are independent from output format: text, one JSON
result, and streaming JSONL use the same completion semantics.
| Code | Meaning |
|------|---------|
| 0 | Success — unit or session completed cleanly |
| 1 | Error or timeout |
| 10 | Blocked — LLM called an interactive tool that requires user input; parent must respond or abort |
| 11 | Cancelled — SIGINT or SIGTERM received |
| 12 | Reload — agent requested restart-with-resume on the same session |
## Failure Modes and Recovery
### Process crash mid-unit
**Detection:** Lock file in `.sf/` is present on next launch; RPC child process is gone.
**Recovery path (`src/resources/extensions/sf/auto-recovery.ts`):**
1. Read the surviving session JSONL from `~/.sf/sessions/<session-id>/`
2. Synthesize a recovery briefing from every tool call recorded on disk
3. Resume the LLM mid-unit with the briefing as context — no state is lost
4. If the session JSONL is unreadable, fall back to starting the unit fresh
### Timeout
**Detection:** Machine-surface parent receives no heartbeat within `HEADLESS_HEARTBEAT_INTERVAL_MS` (60 000 ms), or the unit wall-clock exceeds the configured timeout.
**Recovery path:** `auto-timeout-recovery.ts` writes a timeout summary, marks the unit `needs_fix`, and advances the loop. The parent exits with code 1 unless `--max-restarts` allows a retry.
### Stuck detection (repeating-pattern loops)
**Detection (`src/resources/extensions/sf/auto-stuck-detection.ts`):** Sliding-window analysis over the last ~10 unit results. If the same A→B→A→B pattern repeats, the loop is classified as stuck.
**Recovery path:** Retry once with a deep diagnostic prompt that shows the pattern. If still stuck, stop and surface the exact expected file for human inspection. Stuck state persists across session restarts.
### Provider API errors (transient)
**Detection:** `bootstrap/provider-error-resume.ts` intercepts 429, 500, 503 responses.
**Recovery path:** Exponential backoff; re-queue the unit. If a provider is consistently unavailable, route to the configured fallback model.
### Verification gate failures
**Detection:** `auto-verification.ts` runs lint/test after each task; non-zero exit = failure.
**Recovery path:** Auto-retry the task up to 2× with the agent receiving full command output as context. After 2 failures the task is marked `needs_fix` and the loop advances with a warning.
### Budget ceiling hit
**Detection:** `auto-budget.ts` tracks cumulative dollar cost; emits warnings at 75%, 80%, 90%, and halts at 100%.
**Recovery path:** Auto-mode pauses; user must explicitly approve resumption. The current unit is not retried.
## Restart Loop (machine surface)
2026-05-05 15:42:10 +02:00
`sf headless autonomous --max-restarts 3` applies exponential backoff: 5 s → 10 s → 30 s (cap). After exhausting restarts the parent exits with code 1. Each restart resumes via crash recovery above.
## Observability
| Signal | Location |
|--------|----------|
| Structured trace | `.sf/traces/trace-<timestamp>.json` — full session span tree with tokens, cost, duration |
| Event audit log | `.sf/event-log.jsonl` — every unit completion, tool call, decision save (v2 format) |
| Desktop notifications | OS-native; configurable via preferences (`notifications.*`) |
| Stderr progress | Human-readable machine-surface progress goes to stderr; stdout carries the batch JSON result for `--output-format json` or JSONL events for `--output-format stream-json` |
| Heartbeat | Emitted every 60 s to detect hung parent/child communication |
## Release Checks
Before shipping a build:
```bash
just test # full unit test suite
just smoke-test # sf --version, sf --help, sf --print
just typecheck # tsc extensions, no emit
just lint # eslint
```