Closes sf-mp8g4rcd-w01tkh (FINAL prompt-never-sent root cause) — the
agent-runner.js:182 silent early-return that has been causing 59+
runaway-loop:idle-halt feedback entries and the recurring "Autonomous
loop stuck — no heartbeat" cascade.
Root cause: when swarm-dispatch's bus delivers a message and SF
kernel marks the unit as dispatched, the consumer agent's inbox
sometimes doesn't see the message immediately (different MessageBus
instance, SQLite read-cache lag). Previous code returned
{turnsProcessed:0, response:null} silently — caller (swarm-dispatch
dispatchAndWait) swallowed it as "no work" — LLM never ran — unit
appeared cancelled with no diagnostic.
Fix: bounded retry on missing-message with exponential backoff:
50, 100, 200, 400, 800 ms (1.55s total max). If target message
appears during retry → log recovery event, proceed normally. If still
missing after the last retry → throw a loud error with full inbox
state in the message. The caller wraps in try/catch and surfaces it
as turnResult.error, so the autonomous loop sees a real failure
instead of phantom forward progress.
What this resolves:
- Earlier today: `sf headless triage --apply` timed out at 480000ms
because triage-decider subagent hit this bug. With retries, the
triage-decider has 1.55s of latency tolerance to receive its prompt.
- The 59 backlogged runaway-loop:idle-halt entries are symptoms of
the same root cause. Future occurrences will surface as loud errors,
not phantom "stuck" units — operator/auto-supervisor can react.
Validated:
- 578 tests pass (49 files) including agent-runner / swarm-dispatch /
inbox tests.
- runAgentTurn callers (auto/loop.js, agent-swarm.js, swarm-dispatch
dispatchAndWait) all already handle thrown errors via try/catch
with explicit error surfacing — the contract change is safe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>