fix(headless-triage): 60s no-output watchdog to cap session.prompt hang
Some checks are pending
CI / detect-changes (push) Waiting to run
CI / docs-check (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / build (push) Blocked by required conditions
CI / integration-tests (push) Blocked by required conditions
CI / windows-portability (push) Blocked by required conditions
CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions
CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions
CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions
Some checks are pending
CI / detect-changes (push) Waiting to run
CI / docs-check (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / build (push) Blocked by required conditions
CI / integration-tests (push) Blocked by required conditions
CI / windows-portability (push) Blocked by required conditions
CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions
CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions
CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions
When session.prompt() hits the deadlock seam sf-mp8e02m1-zpk903 (Promise
never resolves pre-LLM-dispatch, 0 syscall activity, blocks until outer
abort), the previous triage call had noOutputTimeoutMs=0 — meaning no
fast-fail path. The full 8-minute timeoutMs would burn before the
parent abort fired, wasting 8 minutes of subscription window per stuck
triage attempt.
This adds a 60s no-output watchdog: if no meaningful subagent event
fires for 60s, abort the prompt. Combined with the diagnostic logs in
subagent-runner.ts (commit 67e5ac9db) the operator gets:
[subagent:triage-decider] phase=session.prompt-entered ...
[subagent:triage-decider] STUCK phase=session.prompt 10001ms ...
[forge] triage] apply blocked: triage-decider produced no output for 60000ms
↑ 60s, not 480s
Triage failure stays non-fatal (per the existing handleTriage error
catch in headless.ts:auto-triage path) — the autonomous loop continues
to its main milestone dispatch. Net effect: SF moves forward 8× faster
when the triage deadlock fires.
Doesn't fix the underlying Promise deadlock (still tracked in
sf-mp8e02m1-zpk903 and the new sf-mpmpXXX-... follow-up). This is a
"unblock the autonomous loop now, fix the deadlock later" patch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
67e5ac9db1
commit
d80060fec5
1 changed files with 6 additions and 1 deletions
|
|
@ -443,6 +443,11 @@ async function defaultAgentRunner(
|
|||
tools: options.tools ?? agent.tools,
|
||||
}) ?? agent.systemPrompt;
|
||||
const appendedPrompt = `${composed}\n\n## Task Input\n\n${task}`;
|
||||
// noOutputTimeoutMs = 60000 (60s): without this, a hung session.prompt
|
||||
// (sf-mp8e02m1-zpk903 — Promise deadlock pre-LLM-dispatch with 0 syscall
|
||||
// activity) blocks for the full timeoutMs (8min) before the parent
|
||||
// abort fires. 60s no-output captures the hang fast enough to keep the
|
||||
// autonomous loop moving while the root-cause investigation continues.
|
||||
const result = await runSubagent(
|
||||
{
|
||||
systemPrompt: appendedPrompt,
|
||||
|
|
@ -452,7 +457,7 @@ async function defaultAgentRunner(
|
|||
name: agent.name,
|
||||
},
|
||||
task,
|
||||
{ timeoutMs: DEFAULT_AGENT_TIMEOUT_MS },
|
||||
{ timeoutMs: DEFAULT_AGENT_TIMEOUT_MS, noOutputTimeoutMs: 60_000 },
|
||||
);
|
||||
return {
|
||||
ok: result.ok,
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue