2026-04-30 21:55:17 +02:00
# Reliability
2026-05-08 00:17:47 +02:00
## Exit Codes (machine surface)
`sf headless` is the current machine-surface command. These codes describe the
non-interactive runner and are independent from output format: text, one JSON
result, and streaming JSONL use the same completion semantics.
feat: implement ADR-001 gitignore split and fill placeholder docs
Gitignore (core change):
- Remove stale blanket .sf/ entries from .gitignore (migrated to
.git/info/exclude on 2026-04-29, never cleaned up)
- gitignore.ts: split SF_RUNTIME_EXCLUSION_PATTERNS into two modes —
SF_SYMLINK_EXCLUSION_PATTERNS (blanket .sf for symlink repos where
git cannot traverse the symlink) and SF_RUNTIME_EXCLUSION_PATTERNS
(granular runtime-only patterns for directory repos, enabling
.sf/milestones/ and other durable planning artifacts to be tracked)
- ensureGitInfoExclude() now detects symlink vs directory and writes
the correct patterns, handling transitions between modes cleanly
- ADR-001 status: Proposed → Accepted
Docs:
- Fill 11 placeholder scaffold docs with real SF-specific content:
PLANS, DESIGN, PRODUCT_SENSE, QUALITY_SCORE, RELIABILITY, SECURITY,
design-docs/index.md, exec-plans/active, exec-plans/completed,
exec-plans/tech-debt-tracker, records/index
- Add records note: docs/records/2026-05-01-repo-vcs-and-notifications.md
- ADR-008 status: Accepted → Proposed (deferred — not applicable to
current usage model where Claude Code assists externally, not as a
Pi provider inside SF's dispatch loop)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 22:32:28 +02:00
| Code | Meaning |
|------|---------|
| 0 | Success — unit or session completed cleanly |
| 1 | Error or timeout |
| 10 | Blocked — LLM called an interactive tool that requires user input; parent must respond or abort |
| 11 | Cancelled — SIGINT or SIGTERM received |
| 12 | Reload — agent requested restart-with-resume on the same session |
## Failure Modes and Recovery
### Process crash mid-unit
**Detection:** Lock file in `.sf/` is present on next launch; RPC child process is gone.
**Recovery path (`src/resources/extensions/sf/auto-recovery.ts` ):**
1. Read the surviving session JSONL from `~/.sf/sessions/<session-id>/`
2. Synthesize a recovery briefing from every tool call recorded on disk
3. Resume the LLM mid-unit with the briefing as context — no state is lost
4. If the session JSONL is unreadable, fall back to starting the unit fresh
### Timeout
2026-05-08 00:17:47 +02:00
**Detection:** Machine-surface parent receives no heartbeat within `HEADLESS_HEARTBEAT_INTERVAL_MS` (60 000 ms), or the unit wall-clock exceeds the configured timeout.
feat: implement ADR-001 gitignore split and fill placeholder docs
Gitignore (core change):
- Remove stale blanket .sf/ entries from .gitignore (migrated to
.git/info/exclude on 2026-04-29, never cleaned up)
- gitignore.ts: split SF_RUNTIME_EXCLUSION_PATTERNS into two modes —
SF_SYMLINK_EXCLUSION_PATTERNS (blanket .sf for symlink repos where
git cannot traverse the symlink) and SF_RUNTIME_EXCLUSION_PATTERNS
(granular runtime-only patterns for directory repos, enabling
.sf/milestones/ and other durable planning artifacts to be tracked)
- ensureGitInfoExclude() now detects symlink vs directory and writes
the correct patterns, handling transitions between modes cleanly
- ADR-001 status: Proposed → Accepted
Docs:
- Fill 11 placeholder scaffold docs with real SF-specific content:
PLANS, DESIGN, PRODUCT_SENSE, QUALITY_SCORE, RELIABILITY, SECURITY,
design-docs/index.md, exec-plans/active, exec-plans/completed,
exec-plans/tech-debt-tracker, records/index
- Add records note: docs/records/2026-05-01-repo-vcs-and-notifications.md
- ADR-008 status: Accepted → Proposed (deferred — not applicable to
current usage model where Claude Code assists externally, not as a
Pi provider inside SF's dispatch loop)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 22:32:28 +02:00
**Recovery path:** `auto-timeout-recovery.ts` writes a timeout summary, marks the unit `needs_fix` , and advances the loop. The parent exits with code 1 unless `--max-restarts` allows a retry.
### Stuck detection (repeating-pattern loops)
**Detection (`src/resources/extensions/sf/auto-stuck-detection.ts` ):** Sliding-window analysis over the last ~10 unit results. If the same A→B→A→B pattern repeats, the loop is classified as stuck.
**Recovery path:** Retry once with a deep diagnostic prompt that shows the pattern. If still stuck, stop and surface the exact expected file for human inspection. Stuck state persists across session restarts.
### Provider API errors (transient)
**Detection:** `bootstrap/provider-error-resume.ts` intercepts 429, 500, 503 responses.
**Recovery path:** Exponential backoff; re-queue the unit. If a provider is consistently unavailable, route to the configured fallback model.
### Verification gate failures
**Detection:** `auto-verification.ts` runs lint/test after each task; non-zero exit = failure.
**Recovery path:** Auto-retry the task up to 2× with the agent receiving full command output as context. After 2 failures the task is marked `needs_fix` and the loop advances with a warning.
### Budget ceiling hit
**Detection:** `auto-budget.ts` tracks cumulative dollar cost; emits warnings at 75%, 80%, 90%, and halts at 100%.
**Recovery path:** Auto-mode pauses; user must explicitly approve resumption. The current unit is not retried.
2026-05-08 00:17:47 +02:00
## Restart Loop (machine surface)
feat: implement ADR-001 gitignore split and fill placeholder docs
Gitignore (core change):
- Remove stale blanket .sf/ entries from .gitignore (migrated to
.git/info/exclude on 2026-04-29, never cleaned up)
- gitignore.ts: split SF_RUNTIME_EXCLUSION_PATTERNS into two modes —
SF_SYMLINK_EXCLUSION_PATTERNS (blanket .sf for symlink repos where
git cannot traverse the symlink) and SF_RUNTIME_EXCLUSION_PATTERNS
(granular runtime-only patterns for directory repos, enabling
.sf/milestones/ and other durable planning artifacts to be tracked)
- ensureGitInfoExclude() now detects symlink vs directory and writes
the correct patterns, handling transitions between modes cleanly
- ADR-001 status: Proposed → Accepted
Docs:
- Fill 11 placeholder scaffold docs with real SF-specific content:
PLANS, DESIGN, PRODUCT_SENSE, QUALITY_SCORE, RELIABILITY, SECURITY,
design-docs/index.md, exec-plans/active, exec-plans/completed,
exec-plans/tech-debt-tracker, records/index
- Add records note: docs/records/2026-05-01-repo-vcs-and-notifications.md
- ADR-008 status: Accepted → Proposed (deferred — not applicable to
current usage model where Claude Code assists externally, not as a
Pi provider inside SF's dispatch loop)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 22:32:28 +02:00
2026-05-05 15:42:10 +02:00
`sf headless autonomous --max-restarts 3` applies exponential backoff: 5 s → 10 s → 30 s (cap). After exhausting restarts the parent exits with code 1. Each restart resumes via crash recovery above.
feat: implement ADR-001 gitignore split and fill placeholder docs
Gitignore (core change):
- Remove stale blanket .sf/ entries from .gitignore (migrated to
.git/info/exclude on 2026-04-29, never cleaned up)
- gitignore.ts: split SF_RUNTIME_EXCLUSION_PATTERNS into two modes —
SF_SYMLINK_EXCLUSION_PATTERNS (blanket .sf for symlink repos where
git cannot traverse the symlink) and SF_RUNTIME_EXCLUSION_PATTERNS
(granular runtime-only patterns for directory repos, enabling
.sf/milestones/ and other durable planning artifacts to be tracked)
- ensureGitInfoExclude() now detects symlink vs directory and writes
the correct patterns, handling transitions between modes cleanly
- ADR-001 status: Proposed → Accepted
Docs:
- Fill 11 placeholder scaffold docs with real SF-specific content:
PLANS, DESIGN, PRODUCT_SENSE, QUALITY_SCORE, RELIABILITY, SECURITY,
design-docs/index.md, exec-plans/active, exec-plans/completed,
exec-plans/tech-debt-tracker, records/index
- Add records note: docs/records/2026-05-01-repo-vcs-and-notifications.md
- ADR-008 status: Accepted → Proposed (deferred — not applicable to
current usage model where Claude Code assists externally, not as a
Pi provider inside SF's dispatch loop)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 22:32:28 +02:00
## Observability
| Signal | Location |
|--------|----------|
| Structured trace | `.sf/traces/trace-<timestamp>.json` — full session span tree with tokens, cost, duration |
| Event audit log | `.sf/event-log.jsonl` — every unit completion, tool call, decision save (v2 format) |
| Desktop notifications | OS-native; configurable via preferences (`notifications.*` ) |
2026-05-08 00:17:47 +02:00
| Stderr progress | Human-readable machine-surface progress goes to stderr; stdout carries the batch JSON result for `--output-format json` or JSONL events for `--output-format stream-json` |
feat: implement ADR-001 gitignore split and fill placeholder docs
Gitignore (core change):
- Remove stale blanket .sf/ entries from .gitignore (migrated to
.git/info/exclude on 2026-04-29, never cleaned up)
- gitignore.ts: split SF_RUNTIME_EXCLUSION_PATTERNS into two modes —
SF_SYMLINK_EXCLUSION_PATTERNS (blanket .sf for symlink repos where
git cannot traverse the symlink) and SF_RUNTIME_EXCLUSION_PATTERNS
(granular runtime-only patterns for directory repos, enabling
.sf/milestones/ and other durable planning artifacts to be tracked)
- ensureGitInfoExclude() now detects symlink vs directory and writes
the correct patterns, handling transitions between modes cleanly
- ADR-001 status: Proposed → Accepted
Docs:
- Fill 11 placeholder scaffold docs with real SF-specific content:
PLANS, DESIGN, PRODUCT_SENSE, QUALITY_SCORE, RELIABILITY, SECURITY,
design-docs/index.md, exec-plans/active, exec-plans/completed,
exec-plans/tech-debt-tracker, records/index
- Add records note: docs/records/2026-05-01-repo-vcs-and-notifications.md
- ADR-008 status: Accepted → Proposed (deferred — not applicable to
current usage model where Claude Code assists externally, not as a
Pi provider inside SF's dispatch loop)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 22:32:28 +02:00
| Heartbeat | Emitted every 60 s to detect hung parent/child communication |
## Release Checks
Before shipping a build:
```bash
just test # full unit test suite
just smoke-test # sf --version, sf --help, sf --print
just typecheck # tsc extensions, no emit
just lint # eslint
```