Gitignore (core change): - Remove stale blanket .sf/ entries from .gitignore (migrated to .git/info/exclude on 2026-04-29, never cleaned up) - gitignore.ts: split SF_RUNTIME_EXCLUSION_PATTERNS into two modes — SF_SYMLINK_EXCLUSION_PATTERNS (blanket .sf for symlink repos where git cannot traverse the symlink) and SF_RUNTIME_EXCLUSION_PATTERNS (granular runtime-only patterns for directory repos, enabling .sf/milestones/ and other durable planning artifacts to be tracked) - ensureGitInfoExclude() now detects symlink vs directory and writes the correct patterns, handling transitions between modes cleanly - ADR-001 status: Proposed → Accepted Docs: - Fill 11 placeholder scaffold docs with real SF-specific content: PLANS, DESIGN, PRODUCT_SENSE, QUALITY_SCORE, RELIABILITY, SECURITY, design-docs/index.md, exec-plans/active, exec-plans/completed, exec-plans/tech-debt-tracker, records/index - Add records note: docs/records/2026-05-01-repo-vcs-and-notifications.md - ADR-008 status: Accepted → Proposed (deferred — not applicable to current usage model where Claude Code assists externally, not as a Pi provider inside SF's dispatch loop) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3.5 KiB
Reliability
Exit Codes (headless mode)
| Code | Meaning |
|---|---|
| 0 | Success — unit or session completed cleanly |
| 1 | Error or timeout |
| 10 | Blocked — LLM called an interactive tool that requires user input; parent must respond or abort |
| 11 | Cancelled — SIGINT or SIGTERM received |
| 12 | Reload — agent requested restart-with-resume on the same session |
Failure Modes and Recovery
Process crash mid-unit
Detection: Lock file in .sf/ is present on next launch; RPC child process is gone.
Recovery path (src/resources/extensions/sf/auto-recovery.ts):
- Read the surviving session JSONL from
~/.sf/sessions/<session-id>/ - Synthesize a recovery briefing from every tool call recorded on disk
- Resume the LLM mid-unit with the briefing as context — no state is lost
- If the session JSONL is unreadable, fall back to starting the unit fresh
Timeout
Detection: Headless parent receives no heartbeat within HEADLESS_HEARTBEAT_INTERVAL_MS (60 000 ms), or the unit wall-clock exceeds the configured timeout.
Recovery path: auto-timeout-recovery.ts writes a timeout summary, marks the unit needs_fix, and advances the loop. The parent exits with code 1 unless --max-restarts allows a retry.
Stuck detection (repeating-pattern loops)
Detection (src/resources/extensions/sf/auto-stuck-detection.ts): Sliding-window analysis over the last ~10 unit results. If the same A→B→A→B pattern repeats, the loop is classified as stuck.
Recovery path: Retry once with a deep diagnostic prompt that shows the pattern. If still stuck, stop and surface the exact expected file for human inspection. Stuck state persists across session restarts.
Provider API errors (transient)
Detection: bootstrap/provider-error-resume.ts intercepts 429, 500, 503 responses.
Recovery path: Exponential backoff; re-queue the unit. If a provider is consistently unavailable, route to the configured fallback model.
Verification gate failures
Detection: auto-verification.ts runs lint/test after each task; non-zero exit = failure.
Recovery path: Auto-retry the task up to 2× with the agent receiving full command output as context. After 2 failures the task is marked needs_fix and the loop advances with a warning.
Budget ceiling hit
Detection: auto-budget.ts tracks cumulative dollar cost; emits warnings at 75%, 80%, 90%, and halts at 100%.
Recovery path: Auto-mode pauses; user must explicitly approve resumption. The current unit is not retried.
Restart Loop (headless daemon mode)
sf headless auto --max-restarts 3 applies exponential backoff: 5 s → 10 s → 30 s (cap). After exhausting restarts the parent exits with code 1. Each restart resumes via crash recovery above.
Observability
| Signal | Location |
|---|---|
| Structured trace | .sf/traces/trace-<timestamp>.json — full session span tree with tokens, cost, duration |
| Event audit log | .sf/event-log.jsonl — every unit completion, tool call, decision save (v2 format) |
| Desktop notifications | OS-native; configurable via preferences (notifications.*) |
| Stderr progress | All headless output goes to stderr; stdout carries JSON result when --output-format json |
| Heartbeat | Emitted every 60 s to detect hung parent/child communication |
Release Checks
Before shipping a build:
just test # full unit test suite
just smoke-test # sf --version, sf --help, sf --print
just typecheck # tsc extensions, no emit
just lint # eslint