singularity-forge/docs/RELIABILITY.md
2026-05-05 15:42:10 +02:00

72 lines
3.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Reliability
## Exit Codes (headless mode)
| Code | Meaning |
|------|---------|
| 0 | Success — unit or session completed cleanly |
| 1 | Error or timeout |
| 10 | Blocked — LLM called an interactive tool that requires user input; parent must respond or abort |
| 11 | Cancelled — SIGINT or SIGTERM received |
| 12 | Reload — agent requested restart-with-resume on the same session |
## Failure Modes and Recovery
### Process crash mid-unit
**Detection:** Lock file in `.sf/` is present on next launch; RPC child process is gone.
**Recovery path (`src/resources/extensions/sf/auto-recovery.ts`):**
1. Read the surviving session JSONL from `~/.sf/sessions/<session-id>/`
2. Synthesize a recovery briefing from every tool call recorded on disk
3. Resume the LLM mid-unit with the briefing as context — no state is lost
4. If the session JSONL is unreadable, fall back to starting the unit fresh
### Timeout
**Detection:** Headless parent receives no heartbeat within `HEADLESS_HEARTBEAT_INTERVAL_MS` (60 000 ms), or the unit wall-clock exceeds the configured timeout.
**Recovery path:** `auto-timeout-recovery.ts` writes a timeout summary, marks the unit `needs_fix`, and advances the loop. The parent exits with code 1 unless `--max-restarts` allows a retry.
### Stuck detection (repeating-pattern loops)
**Detection (`src/resources/extensions/sf/auto-stuck-detection.ts`):** Sliding-window analysis over the last ~10 unit results. If the same A→B→A→B pattern repeats, the loop is classified as stuck.
**Recovery path:** Retry once with a deep diagnostic prompt that shows the pattern. If still stuck, stop and surface the exact expected file for human inspection. Stuck state persists across session restarts.
### Provider API errors (transient)
**Detection:** `bootstrap/provider-error-resume.ts` intercepts 429, 500, 503 responses.
**Recovery path:** Exponential backoff; re-queue the unit. If a provider is consistently unavailable, route to the configured fallback model.
### Verification gate failures
**Detection:** `auto-verification.ts` runs lint/test after each task; non-zero exit = failure.
**Recovery path:** Auto-retry the task up to 2× with the agent receiving full command output as context. After 2 failures the task is marked `needs_fix` and the loop advances with a warning.
### Budget ceiling hit
**Detection:** `auto-budget.ts` tracks cumulative dollar cost; emits warnings at 75%, 80%, 90%, and halts at 100%.
**Recovery path:** Auto-mode pauses; user must explicitly approve resumption. The current unit is not retried.
## Restart Loop (headless daemon mode)
`sf headless autonomous --max-restarts 3` applies exponential backoff: 5 s → 10 s → 30 s (cap). After exhausting restarts the parent exits with code 1. Each restart resumes via crash recovery above.
## Observability
| Signal | Location |
|--------|----------|
| Structured trace | `.sf/traces/trace-<timestamp>.json` — full session span tree with tokens, cost, duration |
| Event audit log | `.sf/event-log.jsonl` — every unit completion, tool call, decision save (v2 format) |
| Desktop notifications | OS-native; configurable via preferences (`notifications.*`) |
| Stderr progress | All headless output goes to stderr; stdout carries JSON result when `--output-format json` |
| Heartbeat | Emitted every 60 s to detect hung parent/child communication |
## Release Checks
Before shipping a build:
```bash
just test # full unit test suite
just smoke-test # sf --version, sf --help, sf --print
just typecheck # tsc extensions, no emit
just lint # eslint
```