auto-prompts.js called `join(base, ...)` in 11 places but only imported
`basename` from node:path. Crashed autonomous mode every iteration with
ReferenceError: join is not defined — observed in dr repo, 3 consecutive
iteration failures triggered the hard stop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Schema now accepts the same five levels used elsewhere in the codebase
(minimal/low/medium/high/bypassed) instead of the stale full/restricted/
sandbox triple. Docs and env test updated to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Observed 2026-05-14: a triage --apply run hung for 33+ minutes because
the spawned subagent process stalled (provider SDK call without its own
timeout) and defaultAgentRunner had no watchdog — it waited indefinitely
on proc.on("close").
Adds a per-dispatch watchdog (default 8 min, override via
SF_TRIAGE_AGENT_TIMEOUT_MS env). On expiry: SIGTERM → 5s grace →
SIGKILL. Resolves immediately with ok=false / exitCode=124 (POSIX
timeout convention) so the trust / review / mutation gates surface
the failure as a real outcome instead of a silent stall.
Provider-agnostic: the timeout protects the orchestrator regardless of
which model the router picks. Operators running long-context provider
calls can bump the env var; default 8min matches runTriage /
runReflection's existing completeSimple timeout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex audit follow-up (fix A). manual-attention outcomes were counted
by getGateRunStats but dropped from the user-facing surface — they
inflated `total` invisibly with no distinct column or key, so an
operator couldn't tell a gate with 5 pass / 3 manual-attention apart
from a gate with 5 pass / 3 fail.
Adds `manualAttention: number` to GateHealthEntry and renders it as
its own column between Fail and Retry in the human table. JSON
consumers get the new key alongside pass/fail/retry.
Test count for headless-uok-status.test.mjs: 30/30 (+2 new — column
present in header, distinguishable from fail in row).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds focused unit tests for the slice-3b wiring:
- UokGateRunner.run emits surface/runControl/permissionProfile/
parentTrace on all three trace paths (normal, unknown-gate,
circuit-breaker-blocked) and omits them when absent.
- buildAutonomousUokContext pins surface=autonomous + runControl=
autonomous and derives permissionProfile from session/prefs
(YOLO → low, prefs.permissionLevel honored, "high" default).
- emitAutonomousGate forwards the schema-v2 ctx into UokGateRunner
(covers the phases-pre-dispatch / phases-guards call sites via
the new shared helper).
- handlePlanSlice options.uokContext lands on every seeded Q3-Q8
quality_gates row; without it, rows stay in the legacy null shape.
Refactors phases-pre-dispatch and phases-guards to call the new
emitAutonomousGate helper so the three sites stay in sync going
forward. phases-finalize keeps its inline UokGateRunner because the
verification gate's execute callback isn't a static verdict.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Slice 3b of "Make UOK the SF Control Plane". handlePlanSlice now
accepts an optional uokContext option and threads it into every
insertGateRow call (Q3, Q4 slice gates; Q5, Q6, Q7 per task; Q8
slice closeout).
executePlanSlice derives the ctx from the singleton autonomous session
when one is active — currentTraceId becomes the v2 traceId/parentTrace,
surface and runControl are pinned to "autonomous", permissionProfile
follows session/prefs. Tools invoked outside an autonomous loop
(interactive REPL, headless one-shot) pass uokContext=null and the
seeded rows fall through to the legacy NULL-column shape, classified
as "legacy" by status uok.
Lazy import of auto/session.js keeps headless/test code paths from
paying the session-singleton load cost when they don't need it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Slice 3b of "Make UOK the SF Control Plane". The autonomous loop's
three high-traffic gate sites (resource-version-guard,
pre-dispatch-health-gate, planning-flow-gate in phases-pre-dispatch;
plan-gate in phases-guards; unit-verification-gate in phases-finalize)
now build a schema-v2 UOK run-context per iteration and pass
surface/runControl/permissionProfile/parentTrace into the gate runner.
The gate-runner emits these onto every gate_run trace event, so the
classifier in `sf headless status uok --json` reads them as
coverageStatus: "ok" instead of "legacy".
New helper uok/auto-uok-ctx.js pins surface="autonomous" and
runControl="autonomous" for these phases and derives permissionProfile
from session/prefs: "low" under YOLO or a minimal/low permissionLevel,
"medium" for medium, "high" otherwise (the default).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex audit (Q4) flagged that the mutation gate landed in slice 3a but
the test suite only verified the three earlier gates. Add coverage:
- agree-path: mutation-gate fires with outcome=fail, rejectedCount=1,
resolvedCount=0 (the test fixture has no real ledger entry for the
decision id, so markResolved rejects it — the gate correctly surfaces
the partial failure)
- disagree-path: mutation-gate does NOT fire (apply phase skipped)
Pins the 4-gate contract end-to-end. Suite: 4/4 in this file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Slice 3b of "Make UOK the SF Control Plane". UokGateRunner.run now reads
the schema-v2 run-context fields off ctx and propagates them into every
gate_run trace event (unknown-gate path, circuit-breaker-blocked path,
normal execution path). Fields are omitted when absent so legacy callers
keep the pre-v2 shape and status-uok continues to classify them as
"legacy" rather than "incomplete".
Helper buildGateRunEvent centralizes the trace shape so the three sites
stay in sync.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the missing test case that confirms the fail-closed semantics
the parallel worker shipped in slice 3a: when the trace writer
cannot persist a UOK gate record (e.g. .sf/traces is unwritable),
runTriageApply MUST abort before any subagent runs and surface the
emission failure as the run error.
This pins down the contract codex Q5 noted as soft: enrichment
failures are debug-only, but PRIMARY gate emission for the apply
flow is hard-required. Without observable gates, an apply that
mutates the ledger has no audit trail — refusing is the right call.
Test asserts: trace-dir write failure → ok=false, error contains
"UOK gate emission failed for trusted-agent-source-gate", and the
mocked agentRunner was never invoked.
Suite: 1682/1682.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First production caller of the schema-v2 writer chain. Every
`sf headless triage --apply` invocation now emits four gate_run trace
events with surface=headless, runControl=supervised, permissionProfile=
high, traceId=flowId — making the gates visible in `status uok --json`
with coverageStatus: "ok" (or fail/manual-attention on reject paths).
Gates emitted, in order:
1. trusted-agent-source-gate — fires on the trust precondition:
pass: both triage-decider and rubber-duck are SF-shipped built-ins
fail: missing-agent OR non-builtin source OR untrusted custom runner
(covers all three pre-dispatch refusal paths so operators see the
failure in status uok, not just in the journal)
2. triage-plan-validation-gate — fires on the strict-parse contract:
pass: parseTriagePlanStrict returns a valid plan covering expectedIds
fail: missing marker / bad yaml / unknown id / outcome-required field missing
3. triage-apply-review-gate — fires on the rubber-duck verdict:
pass: rubber-duck: agree → apply phase proceeds
fail: rubber-duck disagreed → clean pause, no mutations
manual-attention: rubber-duck subagent failed to complete
4. triage-apply-mutation-gate — fires after applyTriagePlan:
pass: every approved mutation landed
fail: any rejected mutation
manual-attention: zero approved mutations (all decisions were "fix")
Includes counts in extra: resolvedCount, rejectedCount, pendingFixCount.
Reader-side fixes (codex review follow-up on slice 3a):
- getDistinctGateIds (sf-db-gates.js) now UNIONs trace-event IDs with
quality_gates DB IDs instead of returning trace IDs early when any
exist. The old behavior silently hid slice-scoped DB-only gates the
moment a flow-scoped trace landed.
- getGateMeta (headless-uok-status.ts) now reads BOTH trace events and
DB row, then picks whichever has the later evaluatedAt. Tie-break
prefers trace (flow-scoped gates with no quality_gates FK row are
trace-only). Old behavior preferred trace whenever surface was set,
regardless of timestamp.
Live verification: ran `sf headless triage --apply` 4 times against the
operator's environment (rubber-duck is a project-level override).
trusted-agent-source-gate now shows in `sf headless status uok --json`
with total: 4, fail: 4, coverageStatus: "ok" — proving the schema-v2
metadata round-trips through the trace events and reaches the
classifier.
Tests:
- headless-triage-uok-gates.test.ts (3 new tests): agree path emits
3 pass gates with v2 metadata; disagree path emits review fail;
unknown-id path emits validation fail with no review gate.
- Existing test suites adjusted for the GateMetadataRow →
GateRunContextRow rename (classifier helpers renamed consistently
across .ts source and the .mjs test mirror).
- Full SF + headless apply: 1681/1681.
Still legacy in production (slice 3b targets these next):
- phases-pre-dispatch.js gates: resource-version-guard, pre-dispatch-
health-gate, planning-flow-gate. None of these pass uokContext yet.
- phases-unit.js gates: unit-verification-gate, plan-gate.
- plan-slice.js: Q3/Q4/Q5/Q6/Q7/Q8 seed gates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second slice of "Make UOK the SF Control Plane". Wires the DB-level
capability for schema-v2 gate metadata so future callers can flip
quality_gates rows from "legacy" to "ok"/"stale"/"incomplete" by
passing a canonical uokContext. No production caller passes ctx yet —
slice 3 wires producers (headless triage --apply, phases-pre-dispatch,
phases-unit).
Schema migration v66 (SCHEMA_VERSION bumped 65 → 66):
- quality_gates gains 5 nullable columns: surface, run_control,
permission_profile, trace_id, parent_trace.
- Idempotent ALTERs via PRAGMA table_info probes — fresh-DB CREATE
path already includes the columns; migration only ALTERs older DBs.
- Existing rows keep NULL across the new columns, so classifyCoverage
in headless-uok-status reads them as "legacy" — no day-one warning
flood.
New adapter src/resources/extensions/sf/uok/run-context.js:
- buildUokRunContext(opts) validates and normalizes the canonical
camelCase shape: surface, runControl, permissionProfile, traceId
(required), plus parentTrace, unitType, unitId, milestoneId,
sliceId, taskId (optional). Frozen on success, null on any invalid
or missing required field.
- VALID_SURFACES / VALID_RUN_CONTROLS / VALID_PERMISSION_PROFILES
enums reject typos at build time so we don't get silent schema-v2
rows with garbage in the enum columns.
- uokRunContextToGateColumns(ctx) translates camelCase → snake_case
column shape used by sf-db-gates writers.
Writer chain (sf-db-gates.js):
- insertGateRow now imports uokRunContextToGateColumns and translates
g.uokContext (canonical camelCase) to the SQL column shape. Callers
pass canonical ctx, the DB writer owns translation. NULL on legacy
callers, NULL on malformed ctx.
- saveGateResult mirrors the same translation; uses COALESCE(:col,
col) so a missing ctx on a follow-up update preserves the row's
existing schema-v2 metadata instead of nulling it.
Reader chain (headless-uok-status.ts):
- getGateMeta SELECTs surface, run_control, permission_profile,
trace_id alongside scope and evaluated_at. ORDER BY uses
"evaluated_at IS NULL, evaluated_at DESC" for cross-SQLite safety
(NULLS LAST is not portable).
- classifyCoverage signature changed from (entry, metadataPresent:
bool) to (entry, meta: GateMetadataRow). Returns "incomplete" when
surface is set but runControl/permissionProfile/traceId missing —
surfaces buggy writers instead of silently classifying as "ok".
Tests:
- uok-run-context.test.mjs (12 tests): adapter validation, enum
rejection, optional-field handling, frozen output, column
translation.
- uok-quality-gates-writer.test.mjs (5 tests): real DB round-trip
proving insertGateRow + saveGateResult populate schema-v2 columns
from canonical camelCase ctx, leave NULL on legacy/malformed,
and preserve existing metadata via COALESCE on no-ctx updates.
- headless-uok-status.test.mjs adjusted: classifier now takes
GateMetadataRow; added test for "incomplete" classification.
- sf-db-migration.test.mjs bumped expected version 65 → 66 and
asserts the 5 new quality_gates columns exist.
Full SF suite: 1678/1678 ✓ (+17 from slice 2 + +9 from slice 1).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First slice of "Make UOK the SF Control Plane". Ships the operator-
facing visibility primitive that subsequent slices fill in. No
enforcement yet, no new gates yet — just the contract.
Changes to sf headless status uok:
- Bumps JSON output to schemaVersion: 2.
- Adds coverageStatus per gate (ok | stale | incomplete | missing
| legacy). Slice 1 only populates ok / stale / legacy:
- legacy row predates schema-v2 metadata (every existing row
today). NOT a warning — operators are not paged for
the rich history of pre-v2 records.
- stale schema-v2 row with no runs in window, OR last run
older than the 24h stale threshold. Surfaces gates
that stopped being exercised.
- ok schema-v2 row with recent runs in window.
incomplete / missing wait for the schema-v2 writer adapter
(slice 2) and the configured-gate registry (later).
- Adds the Coverage column to the human table output.
- Removes the stale "missing getDistinctGateIds import" workaround
comment from headless-uok-status.ts:104. The import exists today
(gate-runner.js:5); the comment was lying. Bypassing
UokGateRunner.getHealthSummary is still appropriate but for a
different reason — documented inline.
Tests (28 total, +9 new):
- classifyCoverage: legacy wins over freshness; ok requires
metadata + recent runs; stale fires on no-runs-in-window or
last-run > 24h.
- empty-DB does not false-positive coverage warnings (the bug
codex called out in the plan review).
- formatTable includes the Coverage column and renders each status
distinctly.
hasSchemaV2Metadata is a placeholder that returns false today; it
will read row.surface / row.run_control / row.permission_profile
when those columns ship in slice 2.
Next slice: adapter foundation — start writing schema-v2 metadata
into new gate rows from headless and autonomous paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three coupled changes that together complete the operator-facing
--apply surface for sf headless triage:
1. headless.ts: parse --apply from commandArgs and forward to
handleTriage. The triage option flow now distinguishes inspect
(--list, --json), one-shot (--run), and orchestrated apply
(--apply) cleanly.
2. help-text.ts: triage subcommand line + examples block now document
the --apply mode (triage-decider → rubber-duck pipeline).
3. bootstrap/db-tools.js: resolve_issue tool now accepts the full
canonical evidence-kind set instead of hardcoding "agent-fix":
- agent-fix (default; commit-based fix evidence)
- human-clear (stale, superseded, false positive, intentional close)
- promoted-to-requirement (with required requirement_id)
The tool surfaces a clear error when promoted-to-requirement is
used without requirement_id. The promptGuidelines updated to walk
callers through choosing the right kind.
self-feedback-db.test.mjs extended with coverage for all three
evidence kinds + the missing-requirement_id rejection path.
Together these make sf headless triage --apply genuinely useful: the
agent can produce a plan with any outcome, rubber-duck reviews it,
and the runner applies via resolve_issue with the right evidence
kind per decision.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New module: src/resources/extensions/sf/subagent/prompt-parts.js.
Replaces the copilot-shaped boolean include* matrix with a canonical
SF-native form:
promptParts: [aiSafety, toolInstructions, parallelToolCalling,
customAgentInstructions, environmentContext,
agentBody, ...]
Each part is a registered renderer (PROMPT_PARTS) that emits a
specific section text given context. composeAgentPrompt orders parts
deterministically, deduplicates, and concatenates with consistent
separators. validatePromptParts rejects unknown keys at agent-load
time so typos surface immediately instead of silently producing an
empty section.
Integrated into:
- subagent/agents.js: validateAgentDefinition runs the new
validator at agent discovery; built-in agents must validate
(project/user agents with invalid promptParts get skipped).
- subagent/index.js: dispatch path uses composeAgentPrompt to
assemble the runtime system prompt.
- unit-context-manifest.js: unit-type manifests declare their
promptParts allowlist; validation runs against the same registry
so unit dispatch and agent dispatch share one canonical schema.
- agents/rubber-duck.agent.yaml: converted from the boolean
include* form to the canonical array form.
Tests:
- subagent-agent-yaml.test.mjs: validates the array shape, rejects
unknown part keys, asserts built-in agents validate cleanly,
project overrides win.
- unit-context-manifest-prompt-parts.test.mjs (new): asserts every
unit-type manifest's promptParts is valid per the registry.
The copilot boolean-include shape is intentionally NOT supported:
this is the SF-native canonical form, simpler to read and harder to
typo (no silent no-op for misspelled keys).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Memory enrichment failed for gate test: DB error" warning in test
output was a real API mismatch, not a benign degradation. The previous
code called getRelevantMemoriesRanked(embedding, "gotcha", 2) but the
canonical signature is getRelevantMemoriesRanked(query, limit).
Replace the embedding-based call with a query-string built from
gateId + failureClass + rationale, and pass limit=2. The embedding
helper (computeGateEmbedding) is removed entirely since the memory
store does its own embedding internally.
Also switch the enrichment-failure log from logWarning to debugLog —
gate enrichment is best-effort and must not affect gates, so the
failure path should not surface as a warning to operators.
Test fixture updated to assert against the new API call shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache-split signal {before, after} was named promptParts in the
autonomous-unit dispatch path, overloading the same term that
.agent.yaml uses for declarative prompt-section composition. With the
prompt-parts runtime landing as canonical (`aiSafety`,
`toolInstructions`, ...), the overload becomes confusing —
promptParts now means "list of declarative section keys", not
"before/after cache-split tuple".
Renames in run-unit.js, phases-unit.js (call site), and
run-unit.test.mjs. No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review follow-up (2026-05-14) addressed all three remaining
issues from the earlier rescue pass:
1. Strict plan validation. parseTriagePlanStrict refuses the WHOLE
plan on any malformed item instead of silently dropping. Enforces:
- completion marker "Self-feedback triage complete" present
- exactly one fenced ```yaml block
- every decision has non-empty id + outcome ∈ {fix, promote, close}
- outcome-specific required fields (close → reason; promote →
reason + requirement_id; fix → proposed_approach)
- duplicate ids rejected
- when expectedIds is supplied, decisions must cover the candidate
set exactly — no extras (hallucinated ids), no missing
Returns ParseTriagePlanResult with {plan, error} so the caller can
surface the specific failure reason.
2. Custom-runner trust guard. runTriageApply refuses an injected
options.agentRunner unless allowUntrustedRunner is also explicitly
set. Production callers cannot inject a runner. Without this guard
a custom runner could side-channel-mutate the ledger despite the
read-only tool override (codex Q2).
3. Per-decision failure surfacing. applyTriagePlan now returns
{resolvedIds, rejectedIds, pendingFixIds} instead of just
resolvedIds. runTriageApply reports ok=false if rejectedIds is
non-empty, with the count + ids in the error message. Mutations
still happen one-by-one (no SQL transaction wrapping) but the
failure is no longer silent (codex Q3).
Tests: src/tests/headless-triage-apply.test.ts now covers:
- agree-path runs both agents in order; apply fails on missing
ledger entry → ok=false, rejectedIds populated (the realistic
contract for a test fixture without a seeded DB)
- custom runner without allowUntrustedRunner refuses, agentRunner
never invoked
- rubber-duck disagrees → clean pause, ok=false, agreed=false
- decider fails → skip rubber-duck
- unknown id in plan rejected before review
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review (2026-05-14) flagged the original runTriageApply design as
unsafe: triage-decider was invoked with resolve_issue in its tool list,
so it could (and would) close ledger entries during its own turn —
BEFORE rubber-duck saw the decisions. If rubber-duck disagreed, the
mutations from phase 1 had already landed with no rollback path.
Restructured to a 3-phase plan-and-review pipeline:
Phase 1 — Plan: triage-decider runs READ-ONLY (resolve_issue removed
from both the YAML and the runner's tool override) and emits a
structured YAML plan as a fenced block. The plan is the contract;
parseTriagePlan extracts it.
Phase 2 — Review: rubber-duck reads the parsed plan + the original
ledger entries and votes "rubber-duck: agree" or names concerning
decisions. Read-only tools.
Phase 3 — Apply: ONLY on agreement, this runner (not an agent) calls
markResolved for each close/promote decision. Fix decisions are
surfaced to the operator and never auto-mutate.
Other codex-flagged gaps addressed:
- Trusted-source guard: --apply refuses to run when either agent has
source != "builtin". Project/user overrides shadow built-ins (the
documented precedence), but they don't get to silently disable
rubber-duck's independence. Operators can still customize via
--review mode.
- Plan-not-emitted is a hard refuse: if the decider's output has no
parseable ```yaml decisions: block, the apply runner returns
ok=false with a clear error. We can't audit what we can't read.
- Disagreement is a clean pause, not an error: returns ok=false with
agreed=false and both outputs preserved for operator review.
- The triage-decider YAML's prompt now codifies the plan-only contract
explicitly: "You do not call resolve_issue. You produce a structured
decision plan."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First slice of putting the triage/rubber-duck flow into SF itself
(sf-mp5lnlbc-ty5fec). Two built-in agent definitions ship with SF and
get auto-discovered alongside operator-defined ones — no setup needed.
agents/rubber-duck.agent.yaml
Devil's-advocate critic. Tools: "*". Reviews any artifact (default
consumer: triage --apply pipeline) and surfaces ONLY confidently-real
concerns. High-signal output: "rubber-duck: agree" or `## Concern N:`
sections with evidence citations. Never proposes fixes.
agents/triage-decider.agent.yaml
Self-feedback queue decider. Tools: [resolve_issue, view, grep, glob,
git_log] — read-only investigation plus the one mutating tool needed
to close/promote entries. No edit/write/bash — code fixes go to the
operator. Implements the existing buildInlineFixPrompt protocol
(Fix/Promote/Close per entry).
Both YAMLs include the copilot-style promptParts block as intent
documentation. SF's prompt-composition runtime doesn't honor those
flags yet; the day it lands, the agents pick it up without a YAML edit.
discoverAgents now loads from a built-in directory (sibling agents/
to subagent/) with source: "builtin". User and project definitions
override built-ins by name, preserving the existing precedence model.
Tests assert: (1) both built-ins discovered with source=builtin in
scope=both, (2) project override wins over built-in. Full SF suite:
1637/1637.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator's settings.json defaultModel is for general dispatch (typically
a cheap/flash pick — gemini-3-flash-preview in current config). Mixing
it into the triage candidate pool gave it a chance to win on cost
tie-break against agentic-better but pricier options from the explicit
enabledModels allowlist.
Triage is agentic-heavy; restrict its candidate pool to the operator's
enabledModels (kimi-coding/* + minimax/* + zai/* + …) and let the
agentic-weighted router pick. Also fixes the wildcard expansion path
which was calling a non-existent ai.getModelsByProvider — now correctly
uses ai.getModels(provider).
Dogfood confirms: router now picks kimi-coding/kimi-for-coding
(agentic 90) instead of gemini-3-flash-preview (operator default).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the hardcoded "google-gemini-cli/gemini-3-pro-preview" default and
routes through SF's own model-router using a new
BASE_REQUIREMENTS["self-feedback-triage"] (agentic-heavy: coding 0.4,
instruction 0.8, reasoning 0.8, agentic 0.9).
Candidate selection priority:
1. Explicit options.model override (operator --model)
2. options.candidates (test injection)
3. ~/.sf/agent/settings.json enabledModels (expanded against pi-ai
MODELS catalog) + defaultProvider/defaultModel
4. TRIAGE_FALLBACK_CANDIDATES — Chinese-provider set
(kimi + minimax + zai). Gemini intentionally NOT in the fallback
so operators who removed it from settings don't silently re-default.
Dispatch walks the router-ranked list with retry-on-credential-error so
the top pick failing on missing API keys falls through to the next
candidate (caught the openai-no-key case in dogfood today).
Closes part 1 of sf-mp5khix3-9beona AC1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled product changes from the working tree, validated together:
1. Agent YAML loader (subagent/agents.js + subagent-agent-yaml.test.mjs)
.sf/agents/*.agent.yaml files now load as first-class agent
definitions alongside the existing .agent.md frontmatter format.
Adds `*` wildcard support for the tools field (unrestricted) and a
parseAgentModel helper for the YAML-only model selector. Mirrors
the copilot-style YAML format so SF can consume agent definitions
shared across tools without forcing the markdown wrapping.
2. Solver-pass tool scoping (run-unit.js + phases-unit.js +
run-unit.test.mjs)
New scopeActiveToolsForRunUnit honors an explicit
activeToolsAllowlist so callers can restrict a unit dispatch to a
tighter tool set than the unit-type's default SF allowlist. The
autonomous solver pass uses this to constrain the solver to just
`checkpoint` — solver should reason and persist checkpoints, not
edit files or dispatch tools. Keeps the solver inside its
authority boundary.
Tests: 7/7 in the two affected files; full SF suite stays green.
Not in this commit: the sidekick-trigger event emission in
autonomous-solver.js and the external scripts/sidekick-runner.js +
.agents/policies/proactive-sidekick.yaml — that's an experiment
that stays in the working tree pending operator direction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an optional wireModelId field to the Model interface and a
resolveWireModelId helper. Forge's canonical model.id stays stable for
selection, capability scoring, policy, and history; providers now send
model.wireModelId on the wire when set, model.id otherwise.
Use cases: Azure deployment names, vendor model slugs that differ
from Forge's canonical identity, A/B routing where the operator wants
canonical history but a specific deployment.
Wired through every provider in @singularity-forge/ai (anthropic,
amazon-bedrock, azure-openai-responses, google, google-vertex,
google-gemini-cli, mistral, openai-codex-responses, openai-completions,
openai-responses) plus @singularity-forge/coding-agent's
ModelRegistry (model definitions + per-model overrides).
Tests: openai-completions wireModelId payload coverage +
model-registry-auth-mode coverage for the override + definition fields.
Full pi-ai + coding-agent suite: 956/956 ✓ (7 unrelated skipped).
This realizes the model-registry contract drafted in 1d753af6b.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Discovered via dogfood: \`sf headless triage --run --json\` short-
circuited to the candidate-list JSON before reaching the dispatch
path, so the run never happened.
--run is the action; --json/--list describe output format. Restructure
so --run always dispatches; --json then controls whether the run
result is JSON vs human text. Without --run, --json/--list still emit
the candidate digest as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five unit tests covering the bail-time queue notifier landed in
001740680: notify-with-pointer when candidates exist, plural/singular
noun agreement, silent on empty queue, silent on non-forge basePath,
no-throw when downstream notify itself crashes (bail-path safety).
Locks in the contract for the partial-AC1 slice of sf-mp4rxkwb-l4baga
(autonomous loop surfaces the queue at idle) without yet touching the
larger remaining work (real self-feedback-triage unit type with
begin/dispatch/checkpoint/complete).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codifies AC4 of sf-mp4w2dij-xm6cwj: the regex-only path is the
today-default fast mode. SF_SECURITY_FAST=1 is the explicit opt-in for
callers that want to assert "regex-only, no LLM escalation, sub-100ms"
regardless of any future tiered reviewer landing in the script.
Today the env var changes only the trailing status line so operators
can verify the contract is observable. When the LLM-backed review hook
(AC1) lands, the absence of SF_SECURITY_FAST becomes the trigger for
escalation; setting it=1 keeps offline / pre-commit callers on the
fast path. Locked in by tests in both the .sh and .mjs scanners.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two thin slices toward sf-mp4rxkwb-l4baga:
1. Help text. The triage and reflect commands have shipped over the
last few commits but neither was discoverable via `sf headless help`.
Add both to the command list + add five usage examples covering the
piping and --run patterns.
2. Bail-time queue notifier. When the autonomous loop is about to break
for "no-active-milestone" or "milestone-complete" while open
self-feedback entries still exist, surface the queue with a clear
pointer to `sf headless triage --list` / `--run`. Best-effort wrapper
that never throws — the proper fix (triage as a real unit type with
begin/dispatch/checkpoint/complete lifecycle) is the larger remaining
slice of the parent entry; this just makes the queue VISIBLE at the
exact moment operators historically lost track of it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds runTriage to self-feedback-drain.js, mirroring runReflection in
reflection.js: provider-agnostic dispatch via @singularity-forge/ai's
completeSimple, dependency-injectable for tests, 8-minute timeout race,
clean-finish detection on the canonical "Self-feedback triage complete"
terminator.
`sf headless triage --run [--model provider/modelId]` now dispatches the
canonical triage prompt and writes the model's decision text to
.sf/triage/decisions/<ts>.md. Operators apply the decisions (resolve_issue
calls, code edits) — a tool-enabled variant that lets the model close
entries directly is follow-up work.
Default model: google-gemini-cli/gemini-3-pro-preview (matches
DEFAULT_REFLECTION_MODEL).
Continues the bounded chip away at sf-mp4rxkwb-l4baga: triage now has
both an operator-pipe path (default) and a one-shot dispatch path (--run).
The full unit-type registration that wires this into the autonomous
dispatcher's idle path is the remaining slice of that entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a deterministic, turn-independent path to drain the self-feedback
queue. Modes:
- default: emits the canonical buildInlineFixPrompt() output for
piping into any model (sf headless triage | sf headless -p -)
- --list: human-readable digest sorted by impact↓ effort↑ ts↑
- --json: structured candidate list for tooling
- --max N: cap candidates
Why this matters (partial step toward sf-mp4rxkwb-l4baga): the existing
session_start drain queues triage as `triggerTurn:true,
deliverAs:"followUp"`. When autonomous mode bails at milestone
validation before any turn runs, the followUp gets dropped and the
queue stays unprocessed. This command sidesteps that by rendering the
prompt synchronously to stdout — operators can pipe it into any model
without depending on autonomous-loop turn semantics. The full
unit-type registration that fixes the underlying dispatcher gap is
larger work tracked in the parent entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the @singularity-forge/google-gemini-cli-provider package layout
for the codex CLI integration boundary. The new package owns:
- CodexAppServerClient (the JSON-RPC subprocess client; previously
packages/ai/src/providers/codex-app-server-client.ts, no pi-ai
internal coupling)
- snapshotCodexCliAccount / discoverCodexCliModels (reads
~/.codex/models_cache.json with visibility=list ∧ supported_in_api
filter; previously inline in src/resources/extensions/sf/openai-codex-catalog.js)
openai-codex-responses.ts (the stream-shaping provider) intentionally
stays in @singularity-forge/ai because it depends on pi-ai stream-event
internals and is not reusable outside the provider — same scope as
google-gemini-cli.ts vs google-gemini-cli-provider.
The SF extension's openai-codex-catalog.js is now a thin SF-side cache
writer that delegates to discoverCodexCliModels, mirroring how
gemini-catalog.js delegates to discoverGeminiCliModels. readCodexAvailableModels
became async to match the dynamic-import path; tests updated.
Closes sf-mp4u5fcz-wh6ac9 (with documented AC2 narrowing — see
resolution).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep MODEL_CAPABILITY_PROFILES so all 82 entries declare an explicit
agentic score; the agentic=50 fallback in scoreModel was silently
giving untouched profiles a generous default and letting weak agentic
models slip through execute-task routing. Anchors per the entry's
suggestedFix: coding-only ~25-40, very small/older ~30-40, older
generations ~55-70, frontier agentic ~85-95.
Adds an invariant test that asserts no profile relies on the default.
Closes sf-mp37p9u2-80f2gz.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the loadPrompt("reflection-pass") call site from headless-reflect.ts
into a new renderReflectionPrompt helper in reflection.js. gap-audit
greps EXTENSION_SRC for loadPrompt call sites; without a hit there it
flagged the prompt as orphan even though the headless surface was using
it (sf-mp4warqc-y1u0b3).
Side benefits: fragment composition + variable validation now run via
the canonical path instead of the prior raw fs.readFile + string
substitution.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses sf-mp4vxusa-pn2tnd. Completes the outcomes-verification chain
filed as AC2 of the original sf-mp4rxkwn-jmp039 (AC1 was commit-exists,
shipped 4af10ac1b).
When an agent-fix resolution cites a commit_sha AND the entry has
acceptanceCriteria mentioning specific file paths, verify the cited
commit actually modifies at least one of those files. Without this
check, an agent could stamp ANY existing commit (e.g. the most recent
unrelated commit on main) as the fix evidence — the SHA exists but the
commit has nothing to do with the entry.
Implementation:
extractFilesFromAcceptanceCriteria(acText)
Two extraction strategies:
1. Backticked code spans (most reliable): `src/foo.js`
2. Bare path-like tokens (only when slash + dotted extension
present, no whitespace, no http:// prefix, no leading digit)
Returns [] when AC has no extractable paths — prose-only AC skips
the check rather than rejecting (the silent-skip is the right
failure mode here; we don't want to fabricate rejections when
there's nothing to verify against).
getCommitTouchedFiles(commitSha, basePath)
Shell to git diff-tree --no-commit-id --name-only -r <sha>.
5-second timeout. Returns null on git failure or out-of-repo.
Matching strategy: exact-path-set OR basename-set. The basename
fallback tolerates the common operator informality where AC says
"src/types.ts" but the actual change was at
"packages/ai/src/types.ts". Exact match wins; basename match catches
the typical case without over-trusting (still requires a file with
that exact basename to be touched).
Carve-out: skip the check when getCommitTouchedFiles returns null
(git unavailable / not-a-repo) — same shape as AC1's "ungrokable"
carve-out. The agent-fix-unverified evidence kind remains the
explicit escape hatch for "I want agent-fix attribution but can't
cite a verifiable commit."
Tests (3 new, 19 total):
- rejects_agent_fix_when_commit_does_not_touch_AC_files: real git
init, commit touches src/unrelated.js, AC mentions src/expected.js
→ markResolved returns false. Then commit that DOES touch expected
→ markResolved returns true.
- skips_AC_file_check_when_AC_has_no_extractable_paths: prose-only
AC accepts any commit.
- AC_file_check_tolerates_basename_match: AC says src/types.ts but
commit touches packages/ai/src/types.ts — accepted via basename.
1619/1619 SF extension tests pass; typecheck clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses sf-mp4rxkx0-fkt3e2 (gap:no-prioritization-signal-on-open-queue)
AND closes the consolidating reflection entry sf-mp4w89mv-3ulqp4 (all
four data-plane-isolation siblings now resolved: kind taxonomy,
causal-link relations, memory mirror, prioritization).
Schema v65 adds two columns to self_feedback:
impact_score INTEGER (0-100; default by severity)
effort_estimate INTEGER (1-5; default null → treated as 3 in selector)
Severity-derived default for impact_score, set by insertSelfFeedbackEntry
when no explicit value supplied:
critical → 95
high → 80
medium → 50
low → 20
selectInlineFixCandidates now sorts by:
1. impact desc — high-impact work first
2. effort asc — quick wins ahead of multi-day work at same impact
3. ts asc — older entries break ties (FIFO within priority)
Replaces the pure-FIFO ordering. Operators can override per-entry by
setting impact_score/effort_estimate explicitly at file time, so e.g.
a "low" severity entry with a critical real-world impact gets bumped
above routine "medium" entries.
Migration is idempotent: ensureSelfFeedbackTables (the fresh-DB CREATE
path) already includes both columns, so the v65 ALTER probes via
PRAGMA table_info before adding to avoid "duplicate column" errors on
fresh DBs. Older fixtures still get the ALTER. Two ALTER guards needed
because the columns are added independently and the second probe must
see post-first-ALTER state.
Tests:
sf-db-migration: assertion 64 → 65 + new impact_score/effort_estimate
column-exists checks
self-feedback-drain: prioritization order test (5 entries spanning
all severities + explicit-effort overrides) +
explicit-impact-overrides-default test
1616/1616 SF extension tests pass; typecheck clean.
Note: the consolidating reflection entry sf-mp4w89mv-3ulqp4 (filed by
the reflection layer's deepest-architectural-concern finding) is now
fully addressed across 4 commits today: 2f8ee5725 (memory mirror),
83c28b756 (kind taxonomy), d40a3d21d (causal links), this commit
(prioritization). Resolves both entries in one go.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses sf-mp4rxkwx-jz0soh (gap:no-causal-links-between-self-feedback-
entries). Third sibling of the consolidating reflection entry
sf-mp4w89mv-3ulqp4 (data-plane-isolation cluster).
Schema v64 adds self_feedback_relations:
from_id TEXT NOT NULL (FK → self_feedback.id)
to_id TEXT NOT NULL (FK → self_feedback.id)
relation_kind TEXT NOT NULL (CHECK: closed enum of 5 kinds)
created_at TEXT NOT NULL
PRIMARY KEY (from_id, to_id, relation_kind)
CHECK (from_id != to_id)
INDEX on (to_id, relation_kind) for inbound queries
Allowed kinds: supersedes, duplicate_of, blocks, root_cause_of,
partial_fix_of. The composite PK allows multiple kinds between the
same pair (e.g. "A supersedes B AND blocks B") but prevents exact
triple duplicates.
Helpers in sf-db-self-feedback.js:
SELF_FEEDBACK_RELATION_KINDS frozen array of allowed kinds
linkEntries(from, to, kind) inserts; returns true on new row,
false on PK collision (idempotent),
throws on FK / CHECK / unknown-kind
getRelatedEntries(id) returns [{id, relationKind,
direction: 'outbound'|'inbound'}]
— inbound + outbound in one call
Implementation note: linkEntries uses plain INSERT (NOT INSERT OR IGNORE)
so CHECK and FK violations surface as thrown errors. Idempotency for
PK collisions is implemented by catching the specific error message.
INSERT OR IGNORE would have silently swallowed self-loops and broken FKs
— exactly the kind of writer-layer bug we just fixed in 83c28b756 and
the upsertRequirement repair in f92022730.
Tests:
sf-db-migration.test.mjs — 2 assertion bumps (63 → 64) + new
self_feedback_relations table-exists check
self-feedback-relations.test.mjs (new, 9 tests) —
SELF_FEEDBACK_RELATION_KINDS enum shape
linkEntries inserts new triple
linkEntries idempotent on duplicate
linkEntries allows multiple kinds same pair
linkEntries throws on unknown kind (writer-layer)
linkEntries throws on self-loop (CHECK)
linkEntries throws on missing FK
getRelatedEntries returns outbound + inbound
getRelatedEntries empty for unlinked entries
1610/1610 SF extension tests pass; typecheck clean.
Note on dispatch: this work was first attempted via "sf headless -p"
to dogfood per memory rule. The dispatch ran 99s with 19 tool calls
but went off-script — modified 10+ files in packages/ai/providers/
(adding wireModelId field across all providers, separate refactor)
and never touched sf-db-schema.js or the relations table. Hand-coded
fallback applied; off-script-dispatch pattern logged as another
data point in sf-mp4rxkwb-l4baga (triage-not-a-first-class-unit-type).
The wireModelId provider changes remain uncommitted in the working
tree for operator review — they may be valuable but were not the
requested work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes that complete AC4 of sf-mp4rxkwt-sfthez (kind taxonomy,
commit 83c28b756):
1. Cluster by domain:family prefix instead of exact kind string.
The promoter was clustering on the full `kind` value, which after the
taxonomy enforcement means every entry like
gap:routing:tiebreak-cost-only and gap:routing:agentic-axis-partial-
coverage stayed in cluster size 1. Empirical confirmation: live ledger
2026-05-14 had 10 open entries, max cluster size 1 under exact-string
matching — promoter could never fire on real diverse data.
New behavior: extract first two segments as the cluster key. Entries
sharing domain:family group together; legacy single-segment kinds
cluster as themselves. With this change, the live ledger's gap:routing
family would include 3 entries today.
2. Repair the silently-broken upsertRequirement call (LATENT BUG).
The promoter was calling upsertRequirement with only {id, title,
description, status, class, source} — but the schema binds every
column positionally including {why, primary_owner, supporting_slices,
validation, notes, full_content, superseded_by}. SQLite cannot bind
`undefined`, so EVERY upsert attempt threw — caught silently by the
surrounding try/catch ("non-fatal") with no log line. Result: the
promoter has never successfully created a requirement row in this
project's history, regardless of clustering threshold.
Fix: pass all schema columns explicitly with null defaults for unused
ones. Also encode the human-readable cluster title into description's
first line since the requirements table has no title column (separate
schema-evolution concern, out of scope here).
Tests: new tests/requirement-promoter.test.mjs (5 tests) covers
domain:family clustering when count>=5, no cross-family clustering,
legacy single-segment kinds, below-threshold returns 0, non-forge bail.
The first test would have caught both the prefix clustering miss AND
the upsertRequirement field-binding bug — runs end-to-end through
upsertRequirement → getActiveRequirements.
1601/1601 SF extension tests pass; typecheck clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses sf-mp4rxkwt-sfthez (gap:self-feedback-kind-vocabulary-unbounded).
The reflection report identified this as part of the deepest architectural
concern (4 entries clustered under data-plane isolation), and the
threshold-promoter was structurally unable to fire because every entry's
kind was a unique string (clusters by exact match).
Add a `domain:family[:specific]` taxonomy validated at recordSelfFeedback
write time:
ALLOWED_KIND_DOMAINS enum of allowed top-level domains (gap,
architecture-defect, architectural-risk,
inconsistency, runaway-loop, schema-drift,
janitor-gap, upstream-rollup, reflection,
copilot-parity-gaps, gap-audit-orphan-prompt,
gap-audit-orphan-command, flow-audit,
executor-refused, solver-missing-checkpoint,
runaway-guard-hard-pause,
self-feedback-resolution)
KIND_SEGMENT_RE /^[a-z][a-z0-9]*(?:-[a-z0-9]+)*$/ — kebab-case
per segment
validateKind(kind) accepts:
domain (1-segment legacy)
domain:family (2-segment canonical)
domain👪specific (3-segment specific)
rejects: empty, non-string, >3 segments,
unknown domain, non-kebab segments
recordSelfFeedback now returns null when validateKind fails, with a
warning logged via workflow-logger. Existing rows in the ledger are
grandfathered (validation only fires on NEW writes through this entry
point) so the migration is non-destructive.
This unblocks the threshold-promoter to cluster by domain:family
prefix once the requirement-promoter is updated to do so (separate
follow-up). Detectors and reflection passes can now reason about
domains rather than handfuls of unique strings.
Tests: 3 new (canonical-shapes / malformed-rejected / non-string-rejected).
8 existing test fixtures updated to use canonical kinds (gap:test-feedback
etc.) — they were using bare slugs that the new validation correctly
rejects.
1596/1596 SF extension tests pass; typecheck clean.
Note on prior dispatch: this work was first attempted via "sf headless -p"
to dogfood the new memory rule (drive SF work through sf headless, not
parallel Claude Code agents). The dispatch ran 49s with 8 tool calls but
landed nothing — the same fragility documented in sf-mp4rxkwb-l4baga
(triage-not-a-first-class-unit-type). Hand-coding fallback applied;
fragility data point added to the open entry's evidence trail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Schema head moved to v63 in commit 21d905461 (parallel agent's
"rem-agent-inspired memory discipline + always-in-context invariants
board" track) but the migration tests still asserted v62 — flagged in
the last 2 iterations as "pre-existing migration failures, not mine."
Update both schema-version assertions to 63 + add a context_board
table-exists check after the v63 migration so future schema bumps
explicitly require updating both the version assertion AND the
matching table-presence check (catches naked-version-bump skews).
11/11 migration tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses sf-mp4rp6y2-31jfau (architecture-defect:self-feedback-not-
wired-to-memory-subsystem). The reflection layer surfaced this as part
of the deepest architectural concern in the 2026-05-14T02-49-45Z report:
"resolutions are hidden from the memory graph, SF will continue to
forget its own triaged solutions and fail to cluster identical root
causes."
When markResolved succeeds against the DB, also call memory-store's
createMemory to mirror the closure as a memory entry that detectors
and reflection passes can consult later via getRelevantMemoriesRanked.
Memory entry shape:
category: "self-feedback-resolution"
content: "[<entry.kind>] <entry.summary>\n→ <evidence.kind>: <reason>"
confidence: 0.9
source_unit_type: "self-feedback"
source_unit_id: <entryId>
tags: [
<entry.kind>,
"evidence:<evidence.kind>",
"commit:<sha-12-prefix>" // when commitSha present
"requirement:<reqId>" // when requirementId present
]
Best-effort: any memory-write failure is silently swallowed. The
resolution itself already landed via DB UPDATE + JSONL audit append +
markdown regen — the memory mirror is observability + future detector
consumption, not a correctness requirement. The try/catch ensures a
broken memory subsystem cannot roll back a valid resolution.
Tests (2 new, 13 total in self-feedback-db):
- agent-fix with commitSha → memory entry has [kind, evidence:agent-fix,
commit:<sha-prefix>] tags + sourceUnitId pointing at the resolved entry
- human-clear without commit → memory entry has [kind, evidence:human-
clear] tags only, no commit tag
Pre-existing migration failures in sf-db-migration.test.mjs (2 tests:
v27 spec backfill, v52 routing-history heal) are unrelated to this
commit; same failure mode as last iteration. Logged here so the
1591/1593 pass rate is auditable.
The other three siblings of the consolidating reflection entry
(sf-mp4w89mv-3ulqp4) remain open and need schema migration:
- sf-mp4rxkwt-sfthez kind vocabulary (domain:family[:specific])
- sf-mp4rxkwx-jz0soh causal links (self_feedback_relations table)
- sf-mp4rxkx0-fkt3e2 prioritization (impact_score + effort_estimate cols)
This commit lands the writer-layer-only piece (#4 in the rollup's
suggested fix), unlocking detector + reflection consumption immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>