Phase 1B of the reflection layer: complete the operator-driven loop by
adding actual LLM dispatch. Phase 1A (commit e161a59e2) shipped the
corpus assembler + prompt template + the prompt-emit operator surface.
This commit wires the dispatch end so `sf headless reflect --run`
produces a real report on disk without manual model piping.
Why shell-out to the gemini CLI and not SF's provider abstraction:
reflection is a single-prompt one-shot inference. Going through SF's
full agent dispatch would require a session, model registry, tool
registration, recovery shell — overkill for "render this prompt,
capture text." The gemini CLI handles auth (~/.gemini/oauth_creds.json),
Code Assist project discovery, and protocol drift on its behalf.
Subprocess cost is paid once per reflection (rare).
Implementation:
- reflection.js: runGeminiReflection(prompt, options) spawns
`gemini --yolo --model <model> -p "<directive>"` and pipes the giant
rendered template via stdin (gemini -p reads stdin and appends).
Returns { ok, content, cleanFinish, exitCode, error, stderr }; never
throws. Defaults to gemini-3-pro-preview (0% used on AI Ultra,
strongest agentic model with quota). 8-minute timeout.
cleanFinish detected by REFLECTION_COMPLETE terminator (emitted by
the prompt template's output contract) — operator gets a warning when
the report is truncated.
- headless-reflect.ts: --run flag triggers dispatch + report write
via writeReflectionReport. --model overrides the default. Errors
surface as JSON or text per --json. Successful runs emit the report
path on stdout; failures emit error + truncated stderr.
- help-text.ts: documents --run and --model flags.
- Tests (4 new, 13 total): use a fake `gemini` binary on PATH to
exercise the spawn path without real OAuth/network — covers
ok+cleanFinish, non-zero exit, hang/timeout, missing-terminator.
All 1538 SF extension tests pass; typecheck clean.
Phase 2 follow-up (still gated on sf-mp4rxkwb-l4baga
triage-not-a-first-class-unit-type landing): reflection-pass becomes a
real autonomous-loop unit type, milestone-close auto-triggers it, the
report's `Recommended new self-feedback entries` section gets parsed
and the entries auto-filed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses self-feedback entry sf-mp4uzvcd-pazg6v
(architecture-defect:no-reflection-layer-over-self-feedback-corpus): SF
detected symptoms and triaged individual entries but had no layer that
reasoned about the corpus to recognize recurring structural patterns.
The same architectural pressure expressed itself across multiple entries
with different exact-kind strings; nothing escalated the pattern to a
class. The cognitive work fell on the operator.
This commit ships Phase 1A — the data-assembly + prompt half of the
reflection layer + an operator-driven entry point. Phase 1B (LLM dispatch
via the autonomous loop as a real unit type) lands once
sf-mp4rxkwb-l4baga (triage-not-a-first-class-unit-type) is in.
Files:
- src/resources/extensions/sf/reflection.js (new)
- assembleReflectionCorpus(basePath): bundles open + recent-resolved
self-feedback (full json), last 50 commits via git log, milestone +
slice + task state, all milestone validation verdicts, and prior
reflection report into one struct. Returns null on prerequisite
failure (DB closed) so callers downgrade gracefully.
- renderReflectionCorpusBrief(corpus): renders the corpus into a
markdown brief the LLM consumes in one turn.
- writeReflectionReport(basePath, content): persists to
.sf/reflection/<timestamp>-report.md so next pass detects "what
changed since last reflection."
- src/resources/extensions/sf/prompts/reflection-pass.md (new)
- {{include:working-directory}} prefix.
- Reasoning order: cluster by structural shape (not exact kind),
identify recurring patterns, identify commit/ledger gaps, identify
stale validation drift, identify the deepest architectural concern,
compare against prior report.
- Output contract: structured markdown report with named sections,
terminator REFLECTION_COMPLETE for clean-finish detection.
- Constraints: don't fix anything (reflection layer not executor),
don't resolve entries without commit-SHA evidence, don't invent IDs.
- src/headless-reflect.ts (new) — sf headless reflect [--json]
- Pre-opens the project DB via auto-start.openProjectDbIfPresent
(one-shot bypass path doesn't run the full SF agent bootstrap).
- Default: emits the rendered prompt brief (template + corpus) for
operators to pipe into any model. Lets the corpus-assembly layer
ship and validate before the LLM-dispatch layer is wired.
- --json: emits raw corpus snapshot for tooling.
- src/headless.ts: registers the new "reflect" command after the
existing usage block.
- src/help-text.ts: documents it in the headless command list.
- src/resources/extensions/sf/tests/reflection.test.mjs (new, 9 tests):
null-when-DB-closed; collects open + recent-resolved; excludes >30d
resolutions; captures milestone/slice/task tree; captures validation
verdicts; commits returned as array (best-effort tmpdir is ok); brief
renders all major sections; entry IDs/severity/kind appear in brief;
writeReflectionReport round-trips through assembleReflectionCorpus's
previousReport read.
Live smoke verified: sf headless reflect against the real .sf/sf.db
returns 15 open + 23 recent-resolved entries, 50 commits, 2 milestones,
1 validation file (correctly surfacing M001's stale needs-attention
verdict against actual 5/5 slices done — exactly the case that
motivated this layer).
Total: +848 LOC, full SF extension suite (1534 tests) passes,
typecheck clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two parallel refactors building on the model-registry consolidation:
1. Generation-aware failover (model-route-failure.js, agent-end-recovery.js)
- resolveNextModelRoute now takes unitType so it knows whether the
caller is solver-pinned per ADR-0079 (autonomous-solver). When pinned,
rejects candidates whose canonicalIdFor() differs from the failed
route's canonical id — closes the latent solver-invariant violation
where kimi-coding/kimi-k2.6 could silently fail over to
ollama-cloud/kimi-k2.5:cloud (different generation).
- Cross-generation failover in non-pinned units now emits a structured
logWarning so generation downgrades are visible in traces instead of
looking like an equivalent route switch.
2. Canonical-keyed performance metrics (model-learner.js)
- .sf/model-performance.json now keys by canonical_id with an
{aggregate, by_route} sub-shape instead of fused provider/wire-model
strings. Cross-route history per model is now coherent — kimi-k2.6
reached via kimi-coding accumulates into the same aggregate as
reached via openrouter.
- Migration runs at boot: detects old shape (no 'aggregate' key in
unit-type blob values), distributes each entry into by_route,
recomputes aggregate, writes a backup to
.sf/model-performance.json.pre-canonical-backup. Unmappable route
keys land in _unmapped so nothing is dropped.
- getRouteStats(taskType, routeKey) added for per-route failover
ordering; existing getRankedModels emits canonical IDs for
cross-route strength queries.
3. Tests
- model-registry.test.ts: bundled in this commit (Swarm A's test file
was left untracked when the registry module was committed).
- model-route-failure.test.ts: 12 tests covering solver-pin guard,
same-canonical multi-route failover, generation-downgrade log emit.
- model-learner-canonical.test.ts: 17 tests covering migration
round-trip, aggregate invariant, _unmapped bucket, and zero-default
reads.
- model-learner.test.ts: one existing test updated for the new
_unmapped.by_route shape on bare model IDs.
4. Results
- Targeted tests: 147/147 across registry, route-failure, learner,
learner-canonical.
- Full npm run test:unit: 4707 pass, 0 fail, 83 skipped (no new
regressions vs pre-edit baseline of 4669).
Work parallelized across two Sonnet 4.6 sub-agents in isolated git
worktrees. Contract authored in docs/dev/drafts/model-registry-contract.md
(committed earlier in 1d753af6b) and consumed by both agents.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The static catalog in models.generated.ts carries phantom slugs like
gpt-5-codex / gpt-5.1-codex / gpt-5.1-codex-max / gpt-5.2-codex that the
ChatGPT-account API rejects with HTTP 400 ("model is not supported when
using Codex with a ChatGPT account"). Verified live on this machine:
ERROR: "The 'gpt-5-codex' model is not supported when using Codex with
a ChatGPT account."
Meanwhile the actually-supported slugs for a ChatGPT subscription
(gpt-5.5 default, gpt-5.4, gpt-5.4-mini, gpt-5.3-codex, gpt-5.2) are
not in SF's view at all — so the router scores phantoms, picks one,
dispatch fails, no successful routes record, and routing silently drifts.
The codex CLI itself maintains ~/.codex/models_cache.json with the
authoritative "what THIS account can actually serve" list (visibility +
supported_in_api flags). SF reads that file directly — no duplicate
discovery, no separate API call, single source of truth.
Changes:
- src/resources/extensions/sf/openai-codex-catalog.js (new) — pure file
reader. Resolves CODEX_HOME (or ~/.codex), parses models_cache.json,
filters by visibility==="list" AND supported_in_api===true, mirrors the
result into .sf/runtime/model-catalog/openai-codex.json. Same cache
shape as the generic model-catalog-cache and gemini-catalog modules
so getKnownModelIds picks it up transparently.
- bootstrap/register-hooks.js — wire scheduleOpenaiCodexCatalogRefresh
into session_start, parallel to the existing gemini and generic
catalog refreshes.
- Tests (9): cache-missing, malformed, filter correctness against the
real shape, no-pass-through, slug validation, refresh-writes-cache,
cache-fresh-skips-refresh, and live discovery via the smoke probe
returns exactly ["gpt-5.5", "gpt-5.4", "gpt-5.4-mini", "gpt-5.3-codex",
"gpt-5.2"] on this machine.
Asymmetry vs gemini-cli is appropriate: codex CLI caches locally so SF
just reads the file; gemini-cli does not, so SF's gemini path calls
setupUser + retrieveUserQuota over the wire. Each provider gets the
cheapest reliable discovery path.
Follow-up filed separately: extract codex transport
(codex-app-server-client.ts, openai-codex-responses.ts, this catalog
reader) into a dedicated @singularity-forge/openai-codex-provider
package mirroring the gemini-cli-provider structure for symmetry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a machine-readable headless surface for live LLM-provider usage and
unifies the gemini-cli quota fetch through one helper, removing the
duplication that existed between usage-bar.js and the new package.
1. snapshotGeminiCliAccount in @singularity-forge/google-gemini-cli-provider
- Single source of truth for { projectId, userTierId, userTierName,
paidTier, models[] } via setupUser + retrieveUserQuota.
- Dedups buckets per modelId, keeping the worst (lowest remainingFraction)
so consumers always see the most-restrictive window. Code Assist
sometimes returns multiple buckets per model; the pessimistic choice
is what every consumer needs.
- discoverGeminiCliModels(cwd?) wraps it for catalog-cache callers that
only need the IDs.
2. sf headless usage subcommand
- New src/headless-usage.ts handler. text (default) and --json output.
Uses the package's snapshot directly — no RPC child, no jiti
gymnastics — matching the shape of headless-uok-status / headless-doctor.
- Wired into src/headless.ts after the doctor block.
- Help text adds the command line.
3. usage-bar.js refactored to delegate
- fetchGeminiUsage no longer imports gemini-cli-core directly. It calls
snapshotGeminiCliAccount and reshapes the result into the existing
{ provider, displayName, windows[] } UI contract.
- Eliminates the duplicate setupUser + retrieveUserQuota code path.
- The fast existsSync(~/.gemini/oauth_creds.json) pre-flight stays
so unauth'd users get a friendly message without paying for OAuth
bootstrap.
4. Model registry refactor (separate track committed alongside)
- src/resources/extensions/sf/model-registry.ts (new) consolidates
canonical model identity, capability tier, and generation tags into
one source of truth that auto-model-selection, benchmark-selector,
and model-router now consume instead of maintaining parallel maps.
All 1487 tests pass (151 files); typecheck clean for both the package
and the SF extensions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes for the google-gemini-cli provider, both motivated by today's
dogfood diagnosis: SF was pinned to a single model (gemini-3-flash-preview)
even though the AI Ultra account has access to seven (verified via the live
gemini-cli-core probe), and a transient "No capacity available for model X
on the server" was classified as `unknown` so SF gave up instead of retrying.
1. Account snapshot + model discovery in @singularity-forge/google-gemini-cli-provider
- Add `snapshotGeminiCliAccount(cwd?)` returning { projectId, userTierId,
userTierName, paidTier, models } where `models[]` carries each modelId
with usedFraction, remainingFraction, and resetTime. Built on the same
setupUser + CodeAssistServer.retrieveUserQuota path usage-bar.js
already uses, but extracted to the dedicated package so any consumer
(model picker, capacity diagnostics, catalog cache) can call one helper.
- Add `discoverGeminiCliModels(cwd?)` as a thin "just the IDs" wrapper.
- Both are best-effort: any failure (OAuth expired, no project, network)
returns null silently — never throws.
2. SF-side cache writer at src/resources/extensions/sf/gemini-catalog.js
- Delegates discovery to the package; only handles cache file path,
6-hour TTL, and the session_start lifecycle hook.
- Cache lands at .sf/runtime/model-catalog/google-gemini-cli.json with
the same shape as the generic model-catalog-cache, so getKnownModelIds
and the model picker pick it up transparently.
- Wired into bootstrap/register-hooks.js session_start in parallel with
the existing scheduleModelCatalogRefresh (the generic REST + API-key
path can't reach gemini-cli's OAuth-only Code Assist endpoint).
3. Capacity error classification fix
- error-classifier.js SERVER_RE now matches "no capacity (available|left)",
"capacity (unavailable|exhausted)", and "no capacity ... on the server".
Previously these fell through to kind=unknown, which is not transient,
so agent-end-recovery never retried — even though the same handler
already caps gemini-cli rate-limit backoff at 30s for exactly this
class of transient. With the pattern matched as `server`, the existing
retry-with-backoff path covers it.
The full extension test suite (1386 tests) passes. Typecheck clean for both
the package and the SF extensions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec for consolidating the three alias tables (benchmark-selector,
auto-model-selection, model-router) into a single SF-extension registry
that reads from @singularity-forge/ai's MODELS and enriches it with
canonical_id, generation, and tier. Shared interface for parallel
Swarm A/B/C work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the prompt extraction the triage worker performed in dogfood
round 5 on entry sf-mp37p9u6-eyobzb (inconsistency:prompts-monolithic-
not-modular).
Changes:
- prompts/autonomous-solver-contract.md (new): solver loop block, with
{{include:working-directory}} for the shared prefix.
- prompts/autonomous-executor-contract.md (new): executor loop block,
same fragment include.
- prompts/autonomous-solver-pass.md (new): solver-pass classifier.
- autonomous-solver.js: _buildAutonomousLoopPromptPrefix renamed to
buildAutonomousLoopVars and returns the variables for the new
templates instead of a pre-rendered string. Net -120/+60 lines.
The {{include:fragment}} syntax is already supported by prompt-loader.js
and the working-directory fragment already exists at
prompts/fragments/working-directory.md.
All 1386 tests pass; typecheck clean.
Resolves: sf-mp37p9u6-eyobzb (inconsistency:prompts-monolithic-not-modular)
Co-resolved: sf-mp37p9u0-hebruv (architectural-risk:single-transaction-
migration) — already verified-and-closed by the triage worker via
resolve_issue with kind=agent-fix, evidence "migrateSchema already
uses per-migration BEGIN/COMMIT via runMigrationStep". JSONL audit log
captured the resolution event end-to-end through the new
appendResolutionToJsonl path (commit ce58d3223).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dogfood of the triage worker revealed that the agent can bypass the
resolve_issue tool (which hardcodes kind=agent-fix) and write DB rows
directly with non-canonical evidence shapes (null, or {file, line}).
The earlier credibility check trusted any resolution that had a prose
resolvedReason — a "legacy narrative" carve-out meant to preserve
operator clears predating structured evidence. Brand-new sloppy agent
resolutions slipped through that carve-out: 5/5 of today's triage
resolutions had non-canonical evidence and would have been treated as
authoritative under the old check.
Replace the denylist/legacy-carve-out with an allowlist:
- isSuspectlyResolved returns true unless resolvedEvidence.kind is
in {agent-fix, human-clear, promoted-to-requirement}.
- SUSPECT_RESOLUTION_KINDS is kept as documentation of the
auto-version-bump case but the allowlist makes it redundant for
the actual policy decision.
Tests now cover both failure modes: prose-only resolution (no kind)
and non-canonical evidence shape ({file, line}) both re-include the
entry as a candidate. Legacy entries that genuinely lack an evidence
kind are backfilled to kind=human-clear separately so they keep their
resolution under the stricter check.
A self-feedback entry (sf-mp4qoby4-meiir7, severity=high) was filed
about the underlying bypass — markResolved should ALSO reject or
auto-tag non-canonical writes at the writer layer, since the reader
is currently the only gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The session_start hook only invoked dispatchSelfFeedbackInlineFixIfNeeded
when triage.stillBlocked contained at least one high/critical entry.
After the previous commit rewired the worker as a triage queue that
returns every open forge-local entry (not just high/critical), this
gate stranded medium/low backlog forever at startup — the unit was
never given a chance to triage them.
The dispatcher's own selectInlineFixCandidates is now the source of
truth for eligibility; the call site should call unconditionally.
Keep the high/critical-specific notify (still useful operator signal
when the loud ones are present) but stop using it to gate the dispatch.
The turn_end hook at the bottom of register-hooks.js already calls
the dispatcher unconditionally, so this change aligns the two paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. Self-feedback JSONL is now a real append-only audit log. Previously
markResolved updated the DB row in place but never echoed the
resolution to JSONL, so a DB rebuild via importLegacyJsonlToDb would
re-import all entries with their original pre-resolution state and
silently lose every resolution that had ever landed. The JSONL was a
half event log — creations yes, resolutions no.
- Introduce a `recordType: "resolution"` JSONL record shape. Append
one of these to the project JSONL whenever markResolved succeeds
against the DB. Best-effort: failure to append never blocks the
resolution itself.
- Extend importLegacyJsonlToDb to handle both record types. Entry
creations go through insertSelfFeedbackEntry (ON CONFLICT DO
NOTHING — idempotent). Resolution events go through
resolveSelfFeedbackEntry, which is already a no-op on missing or
already-resolved rows, so replay is idempotent.
- Tests cover: the appended record shape; a DB rebuild correctly
reconstructing resolved_at/resolved_evidence_json from a JSONL
audit trail; orphan resolution events (entry never existed) are a
silent no-op.
Closes self-feedback entry sf-mp4ikbta-2zcbhh.
2. The reconcile path at state-db.js:reconcileSliceTasks warns when an
on-disk SUMMARY.md exists for a task whose DB row is still pending
and refuses to silently import — a safety check so autonomous runs
can't promote themselves to complete by writing a SUMMARY without a
real DB transition. But operators had no remediation path when the
drift was real (lost DB write, hand edit). They had to mutate the
DB by hand.
- New `state-reconcile.js` with `reconcileTaskFromSummary` exposes
the remediation explicitly. Parses the SUMMARY via the existing
parseSummary helper, validates via isValidTaskSummary, and writes
status / completed_at / verification_result / blocker /
key_files / full_summary_md into the DB row through a new
`setTaskSummaryFields` helper in sf-db-tasks.
- Returns structured { ok, reason, applied } outcomes — never
throws — so operator tooling can branch on `db-unavailable`,
`summary-missing`, `summary-invalid`, `task-not-in-db`,
`already-done`.
- The reconcile warning text now points at the helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The inline-fix worker was a partial repair queue — it picked only
high/critical+blocking entries plus my recent gap/architecture-defect
override and left everything else (medium inconsistencies, janitor gaps,
architectural-risks, low-severity gaps) sitting open forever. The
requirement-promoter clusters by exact `kind` string and never fires on
diverse forge-local entries (every open entry currently has a unique
kind), so there is no other sweep that ever touches these. They just
accumulate.
The point of the worker is triage, not just repair: every open entry
should get an eyes-on per session and reach one of three outcomes —
fix, promote to requirement, or close as not-of-value with reason.
Closing deliberately is a valid, expected outcome.
Changes:
- `selectInlineFixCandidates` now returns every open forge-local entry,
modulo the existing credibility check that re-includes suspect
resolutions. Severity and blocking filters are gone; the kind-based
override is no longer needed because everything qualifies.
- The dispatch prompt is rewritten as a three-way triage protocol
(Fix / Promote / Close) with explicit guidance per outcome and
explicit prohibition on the `auto-version-bump` evidence kind (which
would re-open under the credibility check).
- Tests collapse the three filter-coverage tests into a single
"selects every open forge-local entry" assertion that exercises the
full severity × blocking × kind matrix.
Upstream feedback is still excluded — those entries describe behavior
in other repos that the inline-fix unit cannot directly repair.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SF's S05/T02 executor moved the doc back to docs/dev/sf-ace-patterns.md
while completing the slice (correctly: that was the task's stated
deliverable location). The doc is parked under docs/dev/drafts/ because
ACE Coder has no active consumer for it; re-park it.
Keep the ADR-019 / ADR-020 cross-references the executor added —
they are real content improvements over the previous version.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The inline-fix dispatcher had three blind spots that left forge-local
architectural debt rotting in the ledger:
1. Filter required `severity ∈ {high, critical} AND blocking`. Medium
`gap:*` and `architecture-defect:*` entries — describing the exact
class of debt the inline-fix unit was built to repair — were dropped
on the floor. The forge-local queue currently has 0 high+blocking
open entries and 3 architectural gaps, so the old filter would
dispatch on nothing local and fall back to upstream.
2. Resolutions were trusted unconditionally. `auto-version-bump` fires
on any sf-version bump without verifying the bump contained a fix,
silently burying defects.
3. Upstream feedback was merged into the candidate set. Upstream entries
describe behavior observed in OTHER repos (e.g. `flow-audit:repeated-
milestone-failure` from /srv/infra/apps/centralcloud_ops) — the
inline-fix unit edits forge source and cannot repair issues in those
other repos. Including them dispatches work the unit cannot perform.
Changes to `selectInlineFixCandidates`:
- Add kind-based override: entries with `kind` starting with `gap:` or
`architecture-defect:` qualify regardless of severity/blocking.
- Add resolution credibility check: re-include entries resolved with
evidence kind `auto-version-bump`, or with no evidence kind AND no
`resolvedReason` narrative at all. Legacy resolutions with a meaningful
operator narrative (the historical format) are still trusted.
- Drop `readUpstreamSelfFeedback()` from the candidate merge. Upstream
stays readable for SELF-FEEDBACK.md rollups and operator review, just
not auto-dispatched to inline-fix.
Also relax the schedule-e2e readEntries timing assertion from a 100ms
threshold to 500ms — the test is a catastrophic-regression guard, not
a microbenchmark, and parallel-suite jitter on dev machines routinely
adds >100ms even when the underlying read is fast (≤ a few hundred ms).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The autonomous solver was designed precisely to handle executor refusals
(per its own docstring: "the solver role MUST stay on a stable, agentic,
refusal-resistant model independent of any per-unit routing choices"),
but the refusal handler short-circuited past it and emitted a `blocked`
checkpoint, which assessAutonomousSolverTurn unconditionally turns into
a `pause` — defeating autonomous mode every time the router selects a
capability-mismatched executor.
The 1h model-block added in 3f2babb5d was the right primitive but had no
consumer: nothing actually re-dispatched the unit after the model was
blocked, so the block only mattered if the operator manually unpaused
and retried.
This change wires the missing consumer:
- Add per-unit `executorRefusalEscalations` counter to solver state plus
a `recordExecutorRefusalEscalation` helper. Counter persists across
iterations of the same unit and resets on unit change.
- On `executor-refused`: block the refusing model and slice-routing entry
(unchanged), file self-feedback (unchanged), then synthesize a
`continue` checkpoint and return `{ action: "continue" }` directly so
the auto loop re-dispatches the unit. selectAndApplyModel will skip
the now-blocked model and pick a higher-tier fallback.
- Bounded by `MAX_EXECUTOR_REFUSAL_ESCALATIONS=3`. When the budget is
exhausted (an entire fallback chain refused on the same unit), fall
back to the legacy blocked-and-pause path so the operator can review.
- Bypass `assessAutonomousSolverTurn` on the refusal-continue path
because its no-op detector would (correctly) reject a continue over a
refusal transcript — but here the "no-op" is the whole point: we are
explicitly swapping the routed model.
Tests cover the new state field's init/persistence/reset semantics and
the constant's invariants. Full SF extension suite (1369 tests) passes.
Refs: sf-mp3bm6u0-2fskt8 (now fully addressed, not just AC1)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promotes the .draft stub into a fuller 183-line reference covering six
SF patterns (Preferences, PDD, UOK Gates, Notifications, Skills-as-
Contracts, Idempotency) with SF source paths and ACE adoption notes.
Filed under docs/dev/drafts/ with a STATUS: Draft header — no active
consumer yet. SF's own priorities take precedence until ACE Coder
maintainers pull on convergence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Three .test.mjs files now import describe/it from vitest, matching the
harness CLAUDE.md mandates for the SF extension suite.
- schedule-e2e local readEntries threshold raised 50ms → 100ms with a
comment noting full-suite parallelism adds scheduler/filesystem jitter
on dev machines (CI threshold unchanged at 200ms).
- e2e-smoke "headless new-milestone without --context" timeout raised
10s → 30s so the exit-1 assertion isn't flaky under load.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When classifyExecutorRefusal detects an executor refusal, the model is
now temporarily blocked (1-hour TTL) via the existing blocked-models
mechanism. This ensures that on retry — whether automatic or manual —
the router skips the refusing model and the tier-escalation path in
selectAndApplyModel picks a higher-tier alternative.
This satisfies AC1 of self-feedback entry sf-mp3bm6u0-2fskt8.
AC2 (refusal pattern detection) was already satisfied by the existing
apology-no-tools pattern in classifyExecutorRefusal.
Refs: sf-mp3bm6u0-2fskt8
The flow-audit repeated-milestone-failure rollup now includes:
- Active milestone/unit and session pointer (AC1)
- Stale dispatched units (AC2)
- Runaway history (AC3)
- Over-budget child processes (AC3)
This satisfies the acceptance criteria of self-feedback entry
sf-mp3ati7u-qqxcyi so operators can use the rollup evidence to
repair stale dispatch, missing summary, runaway, or child-process
handling without needing to re-run the flow audit manually.
Refs: sf-mp3ati7u-qqxcyi
- sf-db-schema.js: per-migration transaction boundaries (runMigrationStep)
so a late migration failure does not roll back earlier successful ones.
Post-migration assertion recreates routing_history if missing.
- routing-history.js: catch missing routing_history table at init and latch
_dbTableAvailable=false so auto-start does not crash.
- autonomous-solver.js: sticky identity guard in appendAutonomousSolverCheckpoint
pins to orchestrator's unitType/unitId instead of trusting agent's claim.
Emit journal event on identity mismatch. Record mismatchedIdentity diagnostic.
Hard cap MAX_CHECKPOINTS_PER_ITERATION=5 in assessAutonomousSolverTurn.
- Tests: add v52 DB smoke test with auto-start path; add sticky identity
tests (4 cases); add excessive-checkpoint pause test.
Fixes: sf-mp36kfqm-rjrzju, sf-mp37kjmo-1mfuru
Split reorderForCaching into a structured reorderAndSplitForCaching that
returns {before, after} at the semi-static→dynamic section boundary.
- prompt-ordering.js: export reorderAndSplitForCaching — returns null if no
dynamic sections, otherwise {before: static+semi-static, after: dynamic}
- auto.js: import and wire reorderAndSplitForCaching into deps
- phases-unit.js: use split function; pass promptParts to runUnit when split
succeeds; fall back to flat reorderForCaching when null
- run-unit.js: when promptParts is present, send a two-block content array
[{type:text, text:before, cache_control:{type:ephemeral}}, {type:text, text:after}]
so Anthropic-compatible providers cache the stable prefix
- openai-completions.ts: preserve cache_control on text parts in convertMessages;
skip maybeAddOpenRouterAnthropicCacheControl if any part already has cache_control
Tests: 5 new contract tests for reorderAndSplitForCaching; all 4502 unit tests pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Migrate buildPlanMilestonePrompt, buildValidateMilestonePrompt,
buildCompleteMilestonePrompt, buildReplanSlicePrompt,
buildResearchSlicePrompt, and renderSlicePrompt (plan-slice +
refine-slice) from imperative inlined[] push loops to the v2
composeUnitContext API (manifest-driven, prepend/computed support).
Changes:
- unit-context-manifest.js: add 7 new ARTIFACT_KEYS (slice-summaries,
blocker-summaries, queue, verification-classes, outstanding-items,
previous-validation, prior-milestone-summary); update 7 manifests
with correct prepend/inline/computed declarations
- auto-prompts.js: import composeUnitContext; migrate all 6 builders;
remove orphaned old buildValidateMilestonePrompt tail left by
partial prior edit
- tests: add auto-prompts-phase3.test.mjs with 7 contract tests
covering plan-milestone, replan-slice, validate-milestone, and
research-slice prompt generation
Pre-computation pattern: complex async logic (blocker scan, slice
aggregation, verification classes, prior validation) is computed
imperatively before composeUnitContext, then returned from
resolveArtifact. This preserves parallel execution of other artifacts.
buildPlanMilestonePrompt keeps framingBlock imperative: the framing
check wraps the composed inlinedContext rather than going inside the
composer boundary.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 1 — Fragment infrastructure:
- Add {{include:fragment-name}} support to prompt-loader.js
- fragmentsDir registered alongside promptsDir/templatesDir
- warmCache() now reads prompts/fragments/*.md with 'frg:' prefix
- Pre-resolution pass in loadPrompt() resolves {{include:}} before
the {{var}} validator (colon is outside validator regex [a-zA-Z0-9_],
so unresolved includes are caught as parse errors)
- Lazy-load fallback for fragments mirrors existing prompt lazy-load
- Create prompts/fragments/working-directory.md (Variant A: full
contract including 'Do NOT cd to any other directory')
- Create prompts/fragments/working-directory-ops.md (Variant B:
ops prompts, no cd restriction)
- Replace duplicated 3-line Working Directory boilerplate in 17 prompts
with {{include:working-directory}} (12 files) or
{{include:working-directory-ops}} (5 ops files)
- One fix to Working Directory wording now propagates to all 17 prompts
Phase 2 — RFC #4782 stub manifests:
- Add deploy, smoke-production, release, rollback, challenge to
KNOWN_UNIT_TYPES and UNIT_MANIFESTS in unit-context-manifest.js
- All 5 builders already called composeInlinedContext() but returned ""
because resolveManifest() found no entry; now they return live content
- All 26 unit types now have manifests (resolveManifest returns non-null
for every type in KNOWN_UNIT_TYPES)
Tests:
- 5 new tests in prompt-loader-fragments.test.mjs (include resolution,
lazy-load fallback, unknown fragment error, nested var inheritance,
variant-B fragment)
- Full unit suite: 427 files passed, 4476 tests passed, 0 regressions
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In headless mode the showConfirm dialog blocks forever since there is
no TUI to answer it. The user already consented by calling /next or
/autonomous explicitly — the gate adds no value and hangs the run.
Add process.env.SF_HEADLESS !== '1' to the gate condition so headless
runs bypass it and proceed directly to autonomous execution.
Verified: `sf headless --command next` now completes slice S03
(719 526 tokens, 10 tool calls, $0.027) without hanging.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The log message said '/sf ${command}' but the actual command sent is
'/${command}' (without the sf namespace). Fix to match actual dispatch.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
headless.ts was sending `/sf {subcommand} {args}` to the RPC session, but
commands are registered without the sf namespace (e.g. 'todo', 'autonomous').
_tryExecuteExtensionCommand parsed commandName='sf', found no match, and the
LLM handled the request instead of the typed backend.
Fix: send `/{subcommand} {args}` directly — matches what registerSFCommands
registers and what the TUI already uses.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add profile-aware scaffold system so SF does not lay down irrelevant
templates in infra/ops/docs repos.
## What ships
Phase 1 — data model
- scaffold-versioning.js: add 'disabled' to VALID_STATES; readScaffoldManifest
returns profile field; recordScaffoldApply preserves manifest.profile (fixes
roundtrip bug where profile was stripped on every write).
- scaffold-constants.js: PROFILES (app/library/infra/docs/minimal as Set<string>)
and PROFILE_NAMES exports.
Phase 2 — profile-aware drift detection
- scaffold-drift.js: disabled bucket in emptyCounts, resolveActiveProfileSet
integration, profile param on detectScaffoldDrift/migrateLegacyScaffold.
- doc-checker.js: filter to active profile, skip disabled-state files.
Phase 3 — auto-detection on first run
- scaffold-profiles.js: detectRepoProfile() heuristics (nix→infra,
terraform→infra, react→app, node-no-ui→library, docs-only→docs, else→app).
- agentic-docs-scaffold.js: reads profile from manifest, auto-detects on first
run, persists to manifest, filters SCAFFOLD_FILES to active profile.
Phase 4 — migrate command
- commands-scaffold-migrate.js: sf scaffold migrate --profile <name>
Re-enables pending files entering the new profile; stamps state=disabled
(or prunes with --prune) files leaving it; warns on editing/completed files.
- commands/handlers/ops.js, commands/catalog.js: registered and tab-completed.
Phase 5 — custom profiles + PREFERENCES.md frontmatter
- scaffold-profiles.js: readPreferencesProfile(), loadCustomProfileSet()
(~/.sf/profiles/<name>.yaml with extends/add/remove), resolveActiveProfileSet()
implementing full ADR-022 §6 precedence.
- All callers updated to use resolveActiveProfileSet as the single source of truth.
Tests: 28 new tests in adr-022-scaffold-profiles.test.mjs — all passing.
Pre-existing node:test stubs (3 files) unaffected.
ADR: docs/dev/ADR-022-scaffold-profiles.md
Misc: triage TODO.md dump into BACKLOG.md (phases-helpers export error T1,
/todo triage typed-handler gap T1, structured triage tiers T2, sha-track
markdown files T2, cross-repo triage T3). Reset TODO.md to empty template.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Documents every folder under .agents/, what it contains, and the
override-by-same-name pattern. Explains YOLO as a flag not a mode.
is globally ignored but the spec file under .agents/ must be tracked.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
.agents/ is an override layer. Default modes (ask/build/autonomous)
and default skills come from SF's built-in config. Project files only
exist when overriding or adding something project-specific.
- Remove modes/ask.md, modes/build.md, modes/autonomous.md (defaults)
- Remove enabled.modes from manifest (nothing project-defined)
- Policies and skills stay: they are project-specific overrides
To override a mode or skill, add a file with the same name.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add modes/autonomous.md — third SF mode (ask/build/autonomous).
Describes UOK dispatch loop, bash 120s timeout, fresh-context-per-unit,
recovery/runaway-guard, and when to use vs Build.
- Add autonomous to enabled.modes in manifest.yaml.
- Update policies/yolo.yaml description: YOLO is a flag on Build or
Autonomous, not a mode, not a Shift+Tab stop.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
sf-wiki, forge-autonomous-runtime, forge-command-surface, nix-build,
and smoke-test are all present in .agents/skills/ and must be declared
in enabled.skills per the AGENTS-1 spec.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
.agents/skills/ is the documented standard for project-level skill overrides
(docs/user-docs/skills.md). .sf/skills/ is also searched but .agents/skills/
is the ecosystem-standard path used across all compatible agents.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the fragmented (AGENTS.md + CLAUDE.md + .github/copilot-instructions.md
+ .sf/STYLE.md + .sf/PRINCIPLES.md + .sf/NON-GOALS.md) surface with a
single canonical .agents/ tree per https://github.com/agentsfolder/spec.
Structure:
.agents/manifest.yaml spec metadata + defaults + project info
.agents/prompts/
base.md project-agnostic base prompt
project.md SF-specific: purpose-first, DB-first,
build pipeline, Ask/Build/YOLO model
snippets/{style,principles,non-goals}.md
short pointers into .sf/{STYLE,PRINCIPLES,
NON-GOALS}.md for composition
.agents/modes/{ask,build}.md YAML front matter + human-readable body
.agents/policies/{default-safe,yolo}.yaml
conservative default + YOLO override
.agents/skills/.gitkeep empty per spec — SF's own skills not yet
migrated to agentskills.io format
.agents/scopes/.gitkeep single-tree, no scopes yet
.agents/profiles/.gitkeep no overlays yet
.agents/schemas/.gitkeep generated by validators
.agents/state/.gitignore excludes state.yaml from VCS per spec
Status: spec is pre-1.0 (specVersion 0.1.0 pinned). No agent runtime
currently reads .agents/ — this is structural adoption ahead of
ecosystem support. Legacy files (AGENTS.md, CLAUDE.md, etc.) kept
during the transition; .agents/ is now the canonical surface and they
will eventually point here.
This is the reference template; centralcloud/infra, operations-memory,
oncall-mobile-android to follow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
.sf/skills/ is the project-local skill override directory. This override
inherits all sf-wiki defaults and adds one project-specific rule: wiki
pages use UPPERCASE filenames (INDEX.md, ARCHITECTURE.md, etc.) to match
the .sf/ operational file convention (DECISIONS.md, KNOWLEDGE.md, etc.).
The built-in src/resources/skills/sf-wiki/SKILL.md stays generic (lowercase).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
sf-wiki is a built-in read-only skill — its page name defaults must
stay generic (lowercase). The uppercase convention is this repo's
project-level choice, documented in system.md and the wiki itself.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All .sf/ operational files use UPPERCASE (DECISIONS.md, KNOWLEDGE.md, etc.).
Wiki pages now follow the same convention: INDEX.md, ARCHITECTURE.md,
WORKFLOWS.md, SUBSYSTEMS.md, GLOSSARY.md.
Also updates sf-wiki SKILL.md and system.md prompt references.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Final settled design: sha + git ref only, no DB content snapshots at
all. The mid-edit case (file observed dirty) loses the ability to
reconstruct the intermediate working-tree state, but the change-
detection signal is preserved and the operator can commit first if
intermediate fidelity matters.
Trades a corner-case fidelity loss for a much simpler schema and
no DB-vs-disk content duplication. Git remains the only version
store; the DB row is a pure "where I left off" pointer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without storing snapshots we lose the ability to diff against
"what SF last saw". The fix is hybrid: store the git commit SHA1
that contained the observed content (cheap, no DB blob), and only
fall back to a gzipped snapshot when the file was observed with
uncommitted changes (no git ref exists for that exact content).
For ".sf/-generated, untracked, in .gitignore" the right answer is
to not track them in this table at all.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per follow-up: SF generates many of these .md files itself (.sf/wiki/*,
.sf/milestones/**/*.md, docs/plans/**), so storing gzipped snapshots in
the DB would duplicate disk + git for no benefit.
Simpler design: store only the sha + meta in sf.db; compute diffs
on demand against `git show HEAD:<path>`. Naturally handles both
"working-tree edit not yet committed" and "another agent committed
while SF wasn't running".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per follow-up: not just .sf/milestones/**/*.md but the broader set of
markdown files that SF (or humans) treat as authoritative — AGENTS.md,
.github/copilot-instructions.md, .sf/wiki/**, docs/adr/**,
docs/plans/**, and root-level meta files.
Explicit out-of-scope list: TODO.md (reset every cycle by triage),
CHANGELOG.md / BUILD_PLAN.md (append-only by design), vendored or
generated content. Tracking those would just be noise.
Spec includes a tracked_md_files schema, the walk/diff/surface flow,
and an honest accounting of storage cost (~40 bytes per file + optional
gzipped snapshot).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures a real bug class observed during today's session: nothing
notices when a milestone file (CONTEXT.md, ROADMAP.md, slice PLAN.md,
etc.) is edited out of band — by a human, another agent, or a git pull.
SF keeps using the cached state and drifts.
Wanted: per-file sha tracking in sf.db, diff surface on change, +
hooks for accept/reject/import/archive. Storage cost negligible.
Useful in concert with the cross-repo triage and slash-command routing
gaps already in this TODO.md — together they close most of the
"unattended SF actually works" surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit (1fb4b9882) captured only the reset and lost my intended
additions due to a Read/Write race. Re-applying the four feature
requests from today's dogfooding session:
- Cross-repo `triage-all-repos` (real fix for the "many TODO.md files"
surface area — single tool, per-repo SF dbs, unified read-only
aggregation view).
- Slash-command routing fix (`/todo triage` is currently re-implemented
by the agent's LLM, bypassing the typed backend; patches to
commands-todo.js were silently inert).
- Structured tier/priority per triage item (today tiers exist only in
LLM-prose appended to BUILD_PLAN.md; no parser-friendly field for
"promote Tier 1 items").
- Phases-helpers stale-export error that fires on every SF run; needs
either the missing export restored or a test that catches it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four feature requests captured from today's dogfooding session:
- Cross-repo `triage-all-repos` (real fix for the "many TODO.md files"
surface area — single tool, per-repo SF dbs, unified read-only
aggregation view).
- Slash-command routing fix (`/todo triage` is currently re-implemented
by the agent's LLM, bypassing the typed backend; patches to
commands-todo.js were silently inert).
- Structured tier/priority per triage item (today tiers exist only in
LLM-prose appended to BUILD_PLAN.md; no parser-friendly field for
"promote Tier 1 items").
- Phases-helpers stale-export error that fires on every SF run; needs
either the missing export restored or a test that catches it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Complete the standard wiki page set from sf-wiki SKILL.md:
- subsystems.md: table of all subsystems with path, purpose, tests
- glossary.md: project-specific terms (ADR, UOK, PDD, YOLO, wiki, etc.)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>