Commit graph

4430 commits

Author SHA1 Message Date
Mikael Hugo
454e051aed feat(uok): slice 3a — triage --apply emits 4 schema-v2 UOK gates
First production caller of the schema-v2 writer chain. Every
`sf headless triage --apply` invocation now emits four gate_run trace
events with surface=headless, runControl=supervised, permissionProfile=
high, traceId=flowId — making the gates visible in `status uok --json`
with coverageStatus: "ok" (or fail/manual-attention on reject paths).

Gates emitted, in order:

  1. trusted-agent-source-gate — fires on the trust precondition:
       pass: both triage-decider and rubber-duck are SF-shipped built-ins
       fail: missing-agent OR non-builtin source OR untrusted custom runner
       (covers all three pre-dispatch refusal paths so operators see the
       failure in status uok, not just in the journal)
  2. triage-plan-validation-gate — fires on the strict-parse contract:
       pass: parseTriagePlanStrict returns a valid plan covering expectedIds
       fail: missing marker / bad yaml / unknown id / outcome-required field missing
  3. triage-apply-review-gate — fires on the rubber-duck verdict:
       pass: rubber-duck: agree → apply phase proceeds
       fail: rubber-duck disagreed → clean pause, no mutations
       manual-attention: rubber-duck subagent failed to complete
  4. triage-apply-mutation-gate — fires after applyTriagePlan:
       pass: every approved mutation landed
       fail: any rejected mutation
       manual-attention: zero approved mutations (all decisions were "fix")
     Includes counts in extra: resolvedCount, rejectedCount, pendingFixCount.

Reader-side fixes (codex review follow-up on slice 3a):

  - getDistinctGateIds (sf-db-gates.js) now UNIONs trace-event IDs with
    quality_gates DB IDs instead of returning trace IDs early when any
    exist. The old behavior silently hid slice-scoped DB-only gates the
    moment a flow-scoped trace landed.
  - getGateMeta (headless-uok-status.ts) now reads BOTH trace events and
    DB row, then picks whichever has the later evaluatedAt. Tie-break
    prefers trace (flow-scoped gates with no quality_gates FK row are
    trace-only). Old behavior preferred trace whenever surface was set,
    regardless of timestamp.

Live verification: ran `sf headless triage --apply` 4 times against the
operator's environment (rubber-duck is a project-level override).
trusted-agent-source-gate now shows in `sf headless status uok --json`
with total: 4, fail: 4, coverageStatus: "ok" — proving the schema-v2
metadata round-trips through the trace events and reaches the
classifier.

Tests:
  - headless-triage-uok-gates.test.ts (3 new tests): agree path emits
    3 pass gates with v2 metadata; disagree path emits review fail;
    unknown-id path emits validation fail with no review gate.
  - Existing test suites adjusted for the GateMetadataRow →
    GateRunContextRow rename (classifier helpers renamed consistently
    across .ts source and the .mjs test mirror).
  - Full SF + headless apply: 1681/1681.

Still legacy in production (slice 3b targets these next):
  - phases-pre-dispatch.js gates: resource-version-guard, pre-dispatch-
    health-gate, planning-flow-gate. None of these pass uokContext yet.
  - phases-unit.js gates: unit-verification-gate, plan-gate.
  - plan-slice.js: Q3/Q4/Q5/Q6/Q7/Q8 seed gates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 18:04:50 +02:00
Mikael Hugo
f0c57b58c6 feat(uok): slice 2 — schema-v2 metadata adapter + writer chain
Second slice of "Make UOK the SF Control Plane". Wires the DB-level
capability for schema-v2 gate metadata so future callers can flip
quality_gates rows from "legacy" to "ok"/"stale"/"incomplete" by
passing a canonical uokContext. No production caller passes ctx yet —
slice 3 wires producers (headless triage --apply, phases-pre-dispatch,
phases-unit).

Schema migration v66 (SCHEMA_VERSION bumped 65 → 66):
  - quality_gates gains 5 nullable columns: surface, run_control,
    permission_profile, trace_id, parent_trace.
  - Idempotent ALTERs via PRAGMA table_info probes — fresh-DB CREATE
    path already includes the columns; migration only ALTERs older DBs.
  - Existing rows keep NULL across the new columns, so classifyCoverage
    in headless-uok-status reads them as "legacy" — no day-one warning
    flood.

New adapter src/resources/extensions/sf/uok/run-context.js:
  - buildUokRunContext(opts) validates and normalizes the canonical
    camelCase shape: surface, runControl, permissionProfile, traceId
    (required), plus parentTrace, unitType, unitId, milestoneId,
    sliceId, taskId (optional). Frozen on success, null on any invalid
    or missing required field.
  - VALID_SURFACES / VALID_RUN_CONTROLS / VALID_PERMISSION_PROFILES
    enums reject typos at build time so we don't get silent schema-v2
    rows with garbage in the enum columns.
  - uokRunContextToGateColumns(ctx) translates camelCase → snake_case
    column shape used by sf-db-gates writers.

Writer chain (sf-db-gates.js):
  - insertGateRow now imports uokRunContextToGateColumns and translates
    g.uokContext (canonical camelCase) to the SQL column shape. Callers
    pass canonical ctx, the DB writer owns translation. NULL on legacy
    callers, NULL on malformed ctx.
  - saveGateResult mirrors the same translation; uses COALESCE(:col,
    col) so a missing ctx on a follow-up update preserves the row's
    existing schema-v2 metadata instead of nulling it.

Reader chain (headless-uok-status.ts):
  - getGateMeta SELECTs surface, run_control, permission_profile,
    trace_id alongside scope and evaluated_at. ORDER BY uses
    "evaluated_at IS NULL, evaluated_at DESC" for cross-SQLite safety
    (NULLS LAST is not portable).
  - classifyCoverage signature changed from (entry, metadataPresent:
    bool) to (entry, meta: GateMetadataRow). Returns "incomplete" when
    surface is set but runControl/permissionProfile/traceId missing —
    surfaces buggy writers instead of silently classifying as "ok".

Tests:
  - uok-run-context.test.mjs (12 tests): adapter validation, enum
    rejection, optional-field handling, frozen output, column
    translation.
  - uok-quality-gates-writer.test.mjs (5 tests): real DB round-trip
    proving insertGateRow + saveGateResult populate schema-v2 columns
    from canonical camelCase ctx, leave NULL on legacy/malformed,
    and preserve existing metadata via COALESCE on no-ctx updates.
  - headless-uok-status.test.mjs adjusted: classifier now takes
    GateMetadataRow; added test for "incomplete" classification.
  - sf-db-migration.test.mjs bumped expected version 65 → 66 and
    asserts the 5 new quality_gates columns exist.

Full SF suite: 1678/1678 ✓ (+17 from slice 2 + +9 from slice 1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 17:48:05 +02:00
Mikael Hugo
c058bef26d feat(uok-status): slice 1 — schema v2 + coverage classification + legacy tagging
First slice of "Make UOK the SF Control Plane". Ships the operator-
facing visibility primitive that subsequent slices fill in. No
enforcement yet, no new gates yet — just the contract.

Changes to sf headless status uok:

  - Bumps JSON output to schemaVersion: 2.
  - Adds coverageStatus per gate (ok | stale | incomplete | missing
    | legacy). Slice 1 only populates ok / stale / legacy:
      - legacy   row predates schema-v2 metadata (every existing row
                 today). NOT a warning — operators are not paged for
                 the rich history of pre-v2 records.
      - stale    schema-v2 row with no runs in window, OR last run
                 older than the 24h stale threshold. Surfaces gates
                 that stopped being exercised.
      - ok       schema-v2 row with recent runs in window.
    incomplete / missing wait for the schema-v2 writer adapter
    (slice 2) and the configured-gate registry (later).
  - Adds the Coverage column to the human table output.
  - Removes the stale "missing getDistinctGateIds import" workaround
    comment from headless-uok-status.ts:104. The import exists today
    (gate-runner.js:5); the comment was lying. Bypassing
    UokGateRunner.getHealthSummary is still appropriate but for a
    different reason — documented inline.

Tests (28 total, +9 new):
  - classifyCoverage: legacy wins over freshness; ok requires
    metadata + recent runs; stale fires on no-runs-in-window or
    last-run > 24h.
  - empty-DB does not false-positive coverage warnings (the bug
    codex called out in the plan review).
  - formatTable includes the Coverage column and renders each status
    distinctly.

hasSchemaV2Metadata is a placeholder that returns false today; it
will read row.surface / row.run_control / row.permission_profile
when those columns ship in slice 2.

Next slice: adapter foundation — start writing schema-v2 metadata
into new gate rows from headless and autonomous paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 17:35:52 +02:00
Mikael Hugo
12f5eb2279 feat(triage): wire --apply CLI + canonical resolve_issue evidence kinds
Three coupled changes that together complete the operator-facing
--apply surface for sf headless triage:

1. headless.ts: parse --apply from commandArgs and forward to
   handleTriage. The triage option flow now distinguishes inspect
   (--list, --json), one-shot (--run), and orchestrated apply
   (--apply) cleanly.

2. help-text.ts: triage subcommand line + examples block now document
   the --apply mode (triage-decider → rubber-duck pipeline).

3. bootstrap/db-tools.js: resolve_issue tool now accepts the full
   canonical evidence-kind set instead of hardcoding "agent-fix":
   - agent-fix (default; commit-based fix evidence)
   - human-clear (stale, superseded, false positive, intentional close)
   - promoted-to-requirement (with required requirement_id)
   The tool surfaces a clear error when promoted-to-requirement is
   used without requirement_id. The promptGuidelines updated to walk
   callers through choosing the right kind.

   self-feedback-db.test.mjs extended with coverage for all three
   evidence kinds + the missing-requirement_id rejection path.

Together these make sf headless triage --apply genuinely useful: the
agent can produce a plan with any outcome, rubber-duck reviews it,
and the runner applies via resolve_issue with the right evidence
kind per decision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 17:23:10 +02:00
Mikael Hugo
1881918ab8 feat(subagent): prompt-parts runtime — canonical named-parts composition
New module: src/resources/extensions/sf/subagent/prompt-parts.js.
Replaces the copilot-shaped boolean include* matrix with a canonical
SF-native form:

  promptParts: [aiSafety, toolInstructions, parallelToolCalling,
                customAgentInstructions, environmentContext,
                agentBody, ...]

Each part is a registered renderer (PROMPT_PARTS) that emits a
specific section text given context. composeAgentPrompt orders parts
deterministically, deduplicates, and concatenates with consistent
separators. validatePromptParts rejects unknown keys at agent-load
time so typos surface immediately instead of silently producing an
empty section.

Integrated into:
  - subagent/agents.js: validateAgentDefinition runs the new
    validator at agent discovery; built-in agents must validate
    (project/user agents with invalid promptParts get skipped).
  - subagent/index.js: dispatch path uses composeAgentPrompt to
    assemble the runtime system prompt.
  - unit-context-manifest.js: unit-type manifests declare their
    promptParts allowlist; validation runs against the same registry
    so unit dispatch and agent dispatch share one canonical schema.
  - agents/rubber-duck.agent.yaml: converted from the boolean
    include* form to the canonical array form.

Tests:
  - subagent-agent-yaml.test.mjs: validates the array shape, rejects
    unknown part keys, asserts built-in agents validate cleanly,
    project overrides win.
  - unit-context-manifest-prompt-parts.test.mjs (new): asserts every
    unit-type manifest's promptParts is valid per the registry.

The copilot boolean-include shape is intentionally NOT supported:
this is the SF-native canonical form, simpler to read and harder to
typo (no silent no-op for misspelled keys).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 17:22:26 +02:00
Mikael Hugo
f038f2a072 fix(uok-gate-runner): use correct getRelevantMemoriesRanked API
The "Memory enrichment failed for gate test: DB error" warning in test
output was a real API mismatch, not a benign degradation. The previous
code called getRelevantMemoriesRanked(embedding, "gotcha", 2) but the
canonical signature is getRelevantMemoriesRanked(query, limit).

Replace the embedding-based call with a query-string built from
gateId + failureClass + rationale, and pass limit=2. The embedding
helper (computeGateEmbedding) is removed entirely since the memory
store does its own embedding internally.

Also switch the enrichment-failure log from logWarning to debugLog —
gate enrichment is best-effort and must not affect gates, so the
failure path should not surface as a warning to operators.

Test fixture updated to assert against the new API call shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 17:21:18 +02:00
Mikael Hugo
6851869c00 refactor(auto): rename promptParts → promptCacheSplit in run-unit path
The cache-split signal {before, after} was named promptParts in the
autonomous-unit dispatch path, overloading the same term that
.agent.yaml uses for declarative prompt-section composition. With the
prompt-parts runtime landing as canonical (`aiSafety`,
`toolInstructions`, ...), the overload becomes confusing —
promptParts now means "list of declarative section keys", not
"before/after cache-split tuple".

Renames in run-unit.js, phases-unit.js (call site), and
run-unit.test.mjs. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 17:20:59 +02:00
Mikael Hugo
289bf9e264 fix(triage-apply): strict plan validation + custom-runner guard + per-decision failures
Codex review follow-up (2026-05-14) addressed all three remaining
issues from the earlier rescue pass:

1. Strict plan validation. parseTriagePlanStrict refuses the WHOLE
   plan on any malformed item instead of silently dropping. Enforces:
   - completion marker "Self-feedback triage complete" present
   - exactly one fenced ```yaml block
   - every decision has non-empty id + outcome ∈ {fix, promote, close}
   - outcome-specific required fields (close → reason; promote →
     reason + requirement_id; fix → proposed_approach)
   - duplicate ids rejected
   - when expectedIds is supplied, decisions must cover the candidate
     set exactly — no extras (hallucinated ids), no missing
   Returns ParseTriagePlanResult with {plan, error} so the caller can
   surface the specific failure reason.

2. Custom-runner trust guard. runTriageApply refuses an injected
   options.agentRunner unless allowUntrustedRunner is also explicitly
   set. Production callers cannot inject a runner. Without this guard
   a custom runner could side-channel-mutate the ledger despite the
   read-only tool override (codex Q2).

3. Per-decision failure surfacing. applyTriagePlan now returns
   {resolvedIds, rejectedIds, pendingFixIds} instead of just
   resolvedIds. runTriageApply reports ok=false if rejectedIds is
   non-empty, with the count + ids in the error message. Mutations
   still happen one-by-one (no SQL transaction wrapping) but the
   failure is no longer silent (codex Q3).

Tests: src/tests/headless-triage-apply.test.ts now covers:
   - agree-path runs both agents in order; apply fails on missing
     ledger entry → ok=false, rejectedIds populated (the realistic
     contract for a test fixture without a seeded DB)
   - custom runner without allowUntrustedRunner refuses, agentRunner
     never invoked
   - rubber-duck disagrees → clean pause, ok=false, agreed=false
   - decider fails → skip rubber-duck
   - unknown id in plan rejected before review

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 17:19:12 +02:00
Mikael Hugo
d8ce433c7a fix(triage-apply): plan-and-review pipeline, no mutations before agreement
Codex review (2026-05-14) flagged the original runTriageApply design as
unsafe: triage-decider was invoked with resolve_issue in its tool list,
so it could (and would) close ledger entries during its own turn —
BEFORE rubber-duck saw the decisions. If rubber-duck disagreed, the
mutations from phase 1 had already landed with no rollback path.

Restructured to a 3-phase plan-and-review pipeline:

  Phase 1 — Plan: triage-decider runs READ-ONLY (resolve_issue removed
    from both the YAML and the runner's tool override) and emits a
    structured YAML plan as a fenced block. The plan is the contract;
    parseTriagePlan extracts it.

  Phase 2 — Review: rubber-duck reads the parsed plan + the original
    ledger entries and votes "rubber-duck: agree" or names concerning
    decisions. Read-only tools.

  Phase 3 — Apply: ONLY on agreement, this runner (not an agent) calls
    markResolved for each close/promote decision. Fix decisions are
    surfaced to the operator and never auto-mutate.

Other codex-flagged gaps addressed:

  - Trusted-source guard: --apply refuses to run when either agent has
    source != "builtin". Project/user overrides shadow built-ins (the
    documented precedence), but they don't get to silently disable
    rubber-duck's independence. Operators can still customize via
    --review mode.

  - Plan-not-emitted is a hard refuse: if the decider's output has no
    parseable ```yaml decisions: block, the apply runner returns
    ok=false with a clear error. We can't audit what we can't read.

  - Disagreement is a clean pause, not an error: returns ok=false with
    agreed=false and both outputs preserved for operator review.

  - The triage-decider YAML's prompt now codifies the plan-only contract
    explicitly: "You do not call resolve_issue. You produce a structured
    decision plan."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 17:10:43 +02:00
Mikael Hugo
ab682ddd6e feat(subagent): built-in rubber-duck + triage-decider agent YAMLs
First slice of putting the triage/rubber-duck flow into SF itself
(sf-mp5lnlbc-ty5fec). Two built-in agent definitions ship with SF and
get auto-discovered alongside operator-defined ones — no setup needed.

agents/rubber-duck.agent.yaml
  Devil's-advocate critic. Tools: "*". Reviews any artifact (default
  consumer: triage --apply pipeline) and surfaces ONLY confidently-real
  concerns. High-signal output: "rubber-duck: agree" or `## Concern N:`
  sections with evidence citations. Never proposes fixes.

agents/triage-decider.agent.yaml
  Self-feedback queue decider. Tools: [resolve_issue, view, grep, glob,
  git_log] — read-only investigation plus the one mutating tool needed
  to close/promote entries. No edit/write/bash — code fixes go to the
  operator. Implements the existing buildInlineFixPrompt protocol
  (Fix/Promote/Close per entry).

Both YAMLs include the copilot-style promptParts block as intent
documentation. SF's prompt-composition runtime doesn't honor those
flags yet; the day it lands, the agents pick it up without a YAML edit.

discoverAgents now loads from a built-in directory (sibling agents/
to subagent/) with source: "builtin". User and project definitions
override built-ins by name, preserving the existing precedence model.

Tests assert: (1) both built-ins discovered with source=builtin in
scope=both, (2) project override wins over built-in. Full SF suite:
1637/1637.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:53:36 +02:00
Mikael Hugo
192129a69e fix(triage): drop defaultModel from triage candidate pool
Operator's settings.json defaultModel is for general dispatch (typically
a cheap/flash pick — gemini-3-flash-preview in current config). Mixing
it into the triage candidate pool gave it a chance to win on cost
tie-break against agentic-better but pricier options from the explicit
enabledModels allowlist.

Triage is agentic-heavy; restrict its candidate pool to the operator's
enabledModels (kimi-coding/* + minimax/* + zai/* + …) and let the
agentic-weighted router pick. Also fixes the wildcard expansion path
which was calling a non-existent ai.getModelsByProvider — now correctly
uses ai.getModels(provider).

Dogfood confirms: router now picks kimi-coding/kimi-for-coding
(agentic 90) instead of gemini-3-flash-preview (operator default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:40:19 +02:00
Mikael Hugo
98d1b2b258 feat(triage): route runTriage via model-router using operator allowlist
Drops the hardcoded "google-gemini-cli/gemini-3-pro-preview" default and
routes through SF's own model-router using a new
BASE_REQUIREMENTS["self-feedback-triage"] (agentic-heavy: coding 0.4,
instruction 0.8, reasoning 0.8, agentic 0.9).

Candidate selection priority:
  1. Explicit options.model override (operator --model)
  2. options.candidates (test injection)
  3. ~/.sf/agent/settings.json enabledModels (expanded against pi-ai
     MODELS catalog) + defaultProvider/defaultModel
  4. TRIAGE_FALLBACK_CANDIDATES — Chinese-provider set
     (kimi + minimax + zai). Gemini intentionally NOT in the fallback
     so operators who removed it from settings don't silently re-default.

Dispatch walks the router-ranked list with retry-on-credential-error so
the top pick failing on missing API keys falls through to the next
candidate (caught the openai-no-key case in dogfood today).

Closes part 1 of sf-mp5khix3-9beona AC1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:29:56 +02:00
Mikael Hugo
e2dd625d7d sf snapshot: uncommitted changes after 383m inactivity 2026-05-14 16:03:35 +02:00
Mikael Hugo
2f0e5c8054 feat(subagent,run-unit): YAML agent loader + solver-pass tool scoping
Two coupled product changes from the working tree, validated together:

1. Agent YAML loader (subagent/agents.js + subagent-agent-yaml.test.mjs)
   .sf/agents/*.agent.yaml files now load as first-class agent
   definitions alongside the existing .agent.md frontmatter format.
   Adds `*` wildcard support for the tools field (unrestricted) and a
   parseAgentModel helper for the YAML-only model selector. Mirrors
   the copilot-style YAML format so SF can consume agent definitions
   shared across tools without forcing the markdown wrapping.

2. Solver-pass tool scoping (run-unit.js + phases-unit.js +
   run-unit.test.mjs)
   New scopeActiveToolsForRunUnit honors an explicit
   activeToolsAllowlist so callers can restrict a unit dispatch to a
   tighter tool set than the unit-type's default SF allowlist. The
   autonomous solver pass uses this to constrain the solver to just
   `checkpoint` — solver should reason and persist checkpoints, not
   edit files or dispatch tools. Keeps the solver inside its
   authority boundary.

Tests: 7/7 in the two affected files; full SF suite stays green.

Not in this commit: the sidekick-trigger event emission in
autonomous-solver.js and the external scripts/sidekick-runner.js +
.agents/policies/proactive-sidekick.yaml — that's an experiment
that stays in the working tree pending operator direction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 09:40:13 +02:00
Mikael Hugo
7ea41b89ae feat(ai,coding-agent): wireModelId — provider deployment alias
Adds an optional wireModelId field to the Model interface and a
resolveWireModelId helper. Forge's canonical model.id stays stable for
selection, capability scoring, policy, and history; providers now send
model.wireModelId on the wire when set, model.id otherwise.

Use cases: Azure deployment names, vendor model slugs that differ
from Forge's canonical identity, A/B routing where the operator wants
canonical history but a specific deployment.

Wired through every provider in @singularity-forge/ai (anthropic,
amazon-bedrock, azure-openai-responses, google, google-vertex,
google-gemini-cli, mistral, openai-codex-responses, openai-completions,
openai-responses) plus @singularity-forge/coding-agent's
ModelRegistry (model definitions + per-model overrides).

Tests: openai-completions wireModelId payload coverage +
model-registry-auth-mode coverage for the override + definition fields.
Full pi-ai + coding-agent suite: 956/956 ✓ (7 unrelated skipped).

This realizes the model-registry contract drafted in 1d753af6b.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 09:25:21 +02:00
Mikael Hugo
a6c36a4b6b fix(headless-triage): --run takes precedence over --json/--list
Discovered via dogfood: \`sf headless triage --run --json\` short-
circuited to the candidate-list JSON before reaching the dispatch
path, so the run never happened.

--run is the action; --json/--list describe output format. Restructure
so --run always dispatches; --json then controls whether the run
result is JSON vs human text. Without --run, --json/--list still emit
the candidate digest as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 08:29:11 +02:00
Mikael Hugo
65c1914b1f test(idle-triage): lock in surfaceSelfFeedbackQueueOnIdle contract
Five unit tests covering the bail-time queue notifier landed in
001740680: notify-with-pointer when candidates exist, plural/singular
noun agreement, silent on empty queue, silent on non-forge basePath,
no-throw when downstream notify itself crashes (bail-path safety).

Locks in the contract for the partial-AC1 slice of sf-mp4rxkwb-l4baga
(autonomous loop surfaces the queue at idle) without yet touching the
larger remaining work (real self-feedback-triage unit type with
begin/dispatch/checkpoint/complete).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 08:11:10 +02:00
Mikael Hugo
fa9baf71d5 feat(secret-scan): SF_SECURITY_FAST contract for the regex-only fast path
Codifies AC4 of sf-mp4w2dij-xm6cwj: the regex-only path is the
today-default fast mode. SF_SECURITY_FAST=1 is the explicit opt-in for
callers that want to assert "regex-only, no LLM escalation, sub-100ms"
regardless of any future tiered reviewer landing in the script.

Today the env var changes only the trailing status line so operators
can verify the contract is observable. When the LLM-backed review hook
(AC1) lands, the absence of SF_SECURITY_FAST becomes the trigger for
escalation; setting it=1 keeps offline / pre-commit callers on the
fast path. Locked in by tests in both the .sh and .mjs scanners.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:57:02 +02:00
Mikael Hugo
001740680b feat(headless,auto): surface self-feedback queue at autonomous-loop idle
Two thin slices toward sf-mp4rxkwb-l4baga:

1. Help text. The triage and reflect commands have shipped over the
   last few commits but neither was discoverable via `sf headless help`.
   Add both to the command list + add five usage examples covering the
   piping and --run patterns.

2. Bail-time queue notifier. When the autonomous loop is about to break
   for "no-active-milestone" or "milestone-complete" while open
   self-feedback entries still exist, surface the queue with a clear
   pointer to `sf headless triage --list` / `--run`. Best-effort wrapper
   that never throws — the proper fix (triage as a real unit type with
   begin/dispatch/checkpoint/complete lifecycle) is the larger remaining
   slice of the parent entry; this just makes the queue VISIBLE at the
   exact moment operators historically lost track of it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:44:34 +02:00
Mikael Hugo
34521814cc feat(headless): sf headless triage --run — dispatch via @singularity-forge/ai
Adds runTriage to self-feedback-drain.js, mirroring runReflection in
reflection.js: provider-agnostic dispatch via @singularity-forge/ai's
completeSimple, dependency-injectable for tests, 8-minute timeout race,
clean-finish detection on the canonical "Self-feedback triage complete"
terminator.

`sf headless triage --run [--model provider/modelId]` now dispatches the
canonical triage prompt and writes the model's decision text to
.sf/triage/decisions/<ts>.md. Operators apply the decisions (resolve_issue
calls, code edits) — a tool-enabled variant that lets the model close
entries directly is follow-up work.

Default model: google-gemini-cli/gemini-3-pro-preview (matches
DEFAULT_REFLECTION_MODEL).

Continues the bounded chip away at sf-mp4rxkwb-l4baga: triage now has
both an operator-pipe path (default) and a one-shot dispatch path (--run).
The full unit-type registration that wires this into the autonomous
dispatcher's idle path is the remaining slice of that entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:29:29 +02:00
Mikael Hugo
8fde12301f feat(headless): sf headless triage — operator-driven self-feedback drain
Adds a deterministic, turn-independent path to drain the self-feedback
queue. Modes:
  - default: emits the canonical buildInlineFixPrompt() output for
    piping into any model (sf headless triage | sf headless -p -)
  - --list:  human-readable digest sorted by impact↓ effort↑ ts↑
  - --json:  structured candidate list for tooling
  - --max N: cap candidates

Why this matters (partial step toward sf-mp4rxkwb-l4baga): the existing
session_start drain queues triage as `triggerTurn:true,
deliverAs:"followUp"`. When autonomous mode bails at milestone
validation before any turn runs, the followUp gets dropped and the
queue stays unprocessed. This command sidesteps that by rendering the
prompt synchronously to stdout — operators can pipe it into any model
without depending on autonomous-loop turn semantics. The full
unit-type registration that fixes the underlying dispatcher gap is
larger work tracked in the parent entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:04:01 +02:00
Mikael Hugo
a342868068 feat(packages): extract @singularity-forge/openai-codex-provider
Mirrors the @singularity-forge/google-gemini-cli-provider package layout
for the codex CLI integration boundary. The new package owns:

- CodexAppServerClient (the JSON-RPC subprocess client; previously
  packages/ai/src/providers/codex-app-server-client.ts, no pi-ai
  internal coupling)
- snapshotCodexCliAccount / discoverCodexCliModels (reads
  ~/.codex/models_cache.json with visibility=list ∧ supported_in_api
  filter; previously inline in src/resources/extensions/sf/openai-codex-catalog.js)

openai-codex-responses.ts (the stream-shaping provider) intentionally
stays in @singularity-forge/ai because it depends on pi-ai stream-event
internals and is not reusable outside the provider — same scope as
google-gemini-cli.ts vs google-gemini-cli-provider.

The SF extension's openai-codex-catalog.js is now a thin SF-side cache
writer that delegates to discoverCodexCliModels, mirroring how
gemini-catalog.js delegates to discoverGeminiCliModels. readCodexAvailableModels
became async to match the dynamic-import path; tests updated.

Closes sf-mp4u5fcz-wh6ac9 (with documented AC2 narrowing — see
resolution).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:48:19 +02:00
Mikael Hugo
0694803df3 feat(model-router): explicit agentic score for every capability profile
Sweep MODEL_CAPABILITY_PROFILES so all 82 entries declare an explicit
agentic score; the agentic=50 fallback in scoreModel was silently
giving untouched profiles a generous default and letting weak agentic
models slip through execute-task routing. Anchors per the entry's
suggestedFix: coding-only ~25-40, very small/older ~30-40, older
generations ~55-70, frontier agentic ~85-95.

Adds an invariant test that asserts no profile relies on the default.

Closes sf-mp37p9u2-80f2gz.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:28:06 +02:00
Mikael Hugo
48e793c003 refactor(reflect): route reflection-pass through loadPrompt in extension
Move the loadPrompt("reflection-pass") call site from headless-reflect.ts
into a new renderReflectionPrompt helper in reflection.js. gap-audit
greps EXTENSION_SRC for loadPrompt call sites; without a hit there it
flagged the prompt as orphan even though the headless surface was using
it (sf-mp4warqc-y1u0b3).

Side benefits: fragment composition + variable validation now run via
the canonical path instead of the prior raw fs.readFile + string
substitution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:20:38 +02:00
Mikael Hugo
639dcde717 feat(self-feedback): outcomes-verification AC2 — check commit touches AC-mentioned files
Addresses sf-mp4vxusa-pn2tnd. Completes the outcomes-verification chain
filed as AC2 of the original sf-mp4rxkwn-jmp039 (AC1 was commit-exists,
shipped 4af10ac1b).

When an agent-fix resolution cites a commit_sha AND the entry has
acceptanceCriteria mentioning specific file paths, verify the cited
commit actually modifies at least one of those files. Without this
check, an agent could stamp ANY existing commit (e.g. the most recent
unrelated commit on main) as the fix evidence — the SHA exists but the
commit has nothing to do with the entry.

Implementation:

  extractFilesFromAcceptanceCriteria(acText)
    Two extraction strategies:
      1. Backticked code spans (most reliable): `src/foo.js`
      2. Bare path-like tokens (only when slash + dotted extension
         present, no whitespace, no http:// prefix, no leading digit)
    Returns [] when AC has no extractable paths — prose-only AC skips
    the check rather than rejecting (the silent-skip is the right
    failure mode here; we don't want to fabricate rejections when
    there's nothing to verify against).

  getCommitTouchedFiles(commitSha, basePath)
    Shell to git diff-tree --no-commit-id --name-only -r <sha>.
    5-second timeout. Returns null on git failure or out-of-repo.

  Matching strategy: exact-path-set OR basename-set. The basename
  fallback tolerates the common operator informality where AC says
  "src/types.ts" but the actual change was at
  "packages/ai/src/types.ts". Exact match wins; basename match catches
  the typical case without over-trusting (still requires a file with
  that exact basename to be touched).

  Carve-out: skip the check when getCommitTouchedFiles returns null
  (git unavailable / not-a-repo) — same shape as AC1's "ungrokable"
  carve-out. The agent-fix-unverified evidence kind remains the
  explicit escape hatch for "I want agent-fix attribution but can't
  cite a verifiable commit."

Tests (3 new, 19 total):
  - rejects_agent_fix_when_commit_does_not_touch_AC_files: real git
    init, commit touches src/unrelated.js, AC mentions src/expected.js
    → markResolved returns false. Then commit that DOES touch expected
    → markResolved returns true.
  - skips_AC_file_check_when_AC_has_no_extractable_paths: prose-only
    AC accepts any commit.
  - AC_file_check_tolerates_basename_match: AC says src/types.ts but
    commit touches packages/ai/src/types.ts — accepted via basename.

1619/1619 SF extension tests pass; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:01:57 +02:00
Mikael Hugo
2b64f308cf feat(self-feedback): prioritization signal — impact_score + effort_estimate (v65)
Addresses sf-mp4rxkx0-fkt3e2 (gap:no-prioritization-signal-on-open-queue)
AND closes the consolidating reflection entry sf-mp4w89mv-3ulqp4 (all
four data-plane-isolation siblings now resolved: kind taxonomy,
causal-link relations, memory mirror, prioritization).

Schema v65 adds two columns to self_feedback:
  impact_score     INTEGER  (0-100; default by severity)
  effort_estimate  INTEGER  (1-5; default null → treated as 3 in selector)

Severity-derived default for impact_score, set by insertSelfFeedbackEntry
when no explicit value supplied:
  critical → 95
  high     → 80
  medium   → 50
  low      → 20

selectInlineFixCandidates now sorts by:
  1. impact desc — high-impact work first
  2. effort asc  — quick wins ahead of multi-day work at same impact
  3. ts asc      — older entries break ties (FIFO within priority)

Replaces the pure-FIFO ordering. Operators can override per-entry by
setting impact_score/effort_estimate explicitly at file time, so e.g.
a "low" severity entry with a critical real-world impact gets bumped
above routine "medium" entries.

Migration is idempotent: ensureSelfFeedbackTables (the fresh-DB CREATE
path) already includes both columns, so the v65 ALTER probes via
PRAGMA table_info before adding to avoid "duplicate column" errors on
fresh DBs. Older fixtures still get the ALTER. Two ALTER guards needed
because the columns are added independently and the second probe must
see post-first-ALTER state.

Tests:
  sf-db-migration: assertion 64 → 65 + new impact_score/effort_estimate
                   column-exists checks
  self-feedback-drain: prioritization order test (5 entries spanning
                       all severities + explicit-effort overrides) +
                       explicit-impact-overrides-default test

1616/1616 SF extension tests pass; typecheck clean.

Note: the consolidating reflection entry sf-mp4w89mv-3ulqp4 (filed by
the reflection layer's deepest-architectural-concern finding) is now
fully addressed across 4 commits today: 2f8ee5725 (memory mirror),
83c28b756 (kind taxonomy), d40a3d21d (causal links), this commit
(prioritization). Resolves both entries in one go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:56:20 +02:00
Mikael Hugo
d40a3d21dd feat(self-feedback): causal-link relations between entries (v64 migration)
Addresses sf-mp4rxkwx-jz0soh (gap:no-causal-links-between-self-feedback-
entries). Third sibling of the consolidating reflection entry
sf-mp4w89mv-3ulqp4 (data-plane-isolation cluster).

Schema v64 adds self_feedback_relations:
  from_id        TEXT NOT NULL  (FK → self_feedback.id)
  to_id          TEXT NOT NULL  (FK → self_feedback.id)
  relation_kind  TEXT NOT NULL  (CHECK: closed enum of 5 kinds)
  created_at     TEXT NOT NULL
  PRIMARY KEY (from_id, to_id, relation_kind)
  CHECK (from_id != to_id)
  INDEX on (to_id, relation_kind) for inbound queries

Allowed kinds: supersedes, duplicate_of, blocks, root_cause_of,
partial_fix_of. The composite PK allows multiple kinds between the
same pair (e.g. "A supersedes B AND blocks B") but prevents exact
triple duplicates.

Helpers in sf-db-self-feedback.js:
  SELF_FEEDBACK_RELATION_KINDS  frozen array of allowed kinds
  linkEntries(from, to, kind)   inserts; returns true on new row,
                                 false on PK collision (idempotent),
                                 throws on FK / CHECK / unknown-kind
  getRelatedEntries(id)         returns [{id, relationKind,
                                 direction: 'outbound'|'inbound'}]
                                 — inbound + outbound in one call

Implementation note: linkEntries uses plain INSERT (NOT INSERT OR IGNORE)
so CHECK and FK violations surface as thrown errors. Idempotency for
PK collisions is implemented by catching the specific error message.
INSERT OR IGNORE would have silently swallowed self-loops and broken FKs
— exactly the kind of writer-layer bug we just fixed in 83c28b756 and
the upsertRequirement repair in f92022730.

Tests:
  sf-db-migration.test.mjs — 2 assertion bumps (63 → 64) + new
    self_feedback_relations table-exists check
  self-feedback-relations.test.mjs (new, 9 tests) —
    SELF_FEEDBACK_RELATION_KINDS enum shape
    linkEntries inserts new triple
    linkEntries idempotent on duplicate
    linkEntries allows multiple kinds same pair
    linkEntries throws on unknown kind (writer-layer)
    linkEntries throws on self-loop (CHECK)
    linkEntries throws on missing FK
    getRelatedEntries returns outbound + inbound
    getRelatedEntries empty for unlinked entries

1610/1610 SF extension tests pass; typecheck clean.

Note on dispatch: this work was first attempted via "sf headless -p"
to dogfood per memory rule. The dispatch ran 99s with 19 tool calls
but went off-script — modified 10+ files in packages/ai/providers/
(adding wireModelId field across all providers, separate refactor)
and never touched sf-db-schema.js or the relations table. Hand-coded
fallback applied; off-script-dispatch pattern logged as another
data point in sf-mp4rxkwb-l4baga (triage-not-a-first-class-unit-type).
The wireModelId provider changes remain uncommitted in the working
tree for operator review — they may be valuable but were not the
requested work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:46:56 +02:00
Mikael Hugo
f92022730b fix(promoter): cluster by domain:family + repair upsertRequirement field-binding
Two related fixes that complete AC4 of sf-mp4rxkwt-sfthez (kind taxonomy,
commit 83c28b756):

1. Cluster by domain:family prefix instead of exact kind string.

   The promoter was clustering on the full `kind` value, which after the
   taxonomy enforcement means every entry like
   gap:routing:tiebreak-cost-only and gap:routing:agentic-axis-partial-
   coverage stayed in cluster size 1. Empirical confirmation: live ledger
   2026-05-14 had 10 open entries, max cluster size 1 under exact-string
   matching — promoter could never fire on real diverse data.

   New behavior: extract first two segments as the cluster key. Entries
   sharing domain:family group together; legacy single-segment kinds
   cluster as themselves. With this change, the live ledger's gap:routing
   family would include 3 entries today.

2. Repair the silently-broken upsertRequirement call (LATENT BUG).

   The promoter was calling upsertRequirement with only {id, title,
   description, status, class, source} — but the schema binds every
   column positionally including {why, primary_owner, supporting_slices,
   validation, notes, full_content, superseded_by}. SQLite cannot bind
   `undefined`, so EVERY upsert attempt threw — caught silently by the
   surrounding try/catch ("non-fatal") with no log line. Result: the
   promoter has never successfully created a requirement row in this
   project's history, regardless of clustering threshold.

   Fix: pass all schema columns explicitly with null defaults for unused
   ones. Also encode the human-readable cluster title into description's
   first line since the requirements table has no title column (separate
   schema-evolution concern, out of scope here).

Tests: new tests/requirement-promoter.test.mjs (5 tests) covers
domain:family clustering when count>=5, no cross-family clustering,
legacy single-segment kinds, below-threshold returns 0, non-forge bail.
The first test would have caught both the prefix clustering miss AND
the upsertRequirement field-binding bug — runs end-to-end through
upsertRequirement → getActiveRequirements.

1601/1601 SF extension tests pass; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:34:13 +02:00
Mikael Hugo
83c28b756c feat(self-feedback): enforce kind taxonomy at recordSelfFeedback
Addresses sf-mp4rxkwt-sfthez (gap:self-feedback-kind-vocabulary-unbounded).
The reflection report identified this as part of the deepest architectural
concern (4 entries clustered under data-plane isolation), and the
threshold-promoter was structurally unable to fire because every entry's
kind was a unique string (clusters by exact match).

Add a `domain:family[:specific]` taxonomy validated at recordSelfFeedback
write time:

  ALLOWED_KIND_DOMAINS  enum of allowed top-level domains (gap,
                        architecture-defect, architectural-risk,
                        inconsistency, runaway-loop, schema-drift,
                        janitor-gap, upstream-rollup, reflection,
                        copilot-parity-gaps, gap-audit-orphan-prompt,
                        gap-audit-orphan-command, flow-audit,
                        executor-refused, solver-missing-checkpoint,
                        runaway-guard-hard-pause,
                        self-feedback-resolution)

  KIND_SEGMENT_RE       /^[a-z][a-z0-9]*(?:-[a-z0-9]+)*$/  — kebab-case
                        per segment

  validateKind(kind)    accepts:
                          domain                      (1-segment legacy)
                          domain:family               (2-segment canonical)
                          domain👪specific      (3-segment specific)
                        rejects: empty, non-string, >3 segments,
                                 unknown domain, non-kebab segments

recordSelfFeedback now returns null when validateKind fails, with a
warning logged via workflow-logger. Existing rows in the ledger are
grandfathered (validation only fires on NEW writes through this entry
point) so the migration is non-destructive.

This unblocks the threshold-promoter to cluster by domain:family
prefix once the requirement-promoter is updated to do so (separate
follow-up). Detectors and reflection passes can now reason about
domains rather than handfuls of unique strings.

Tests: 3 new (canonical-shapes / malformed-rejected / non-string-rejected).
8 existing test fixtures updated to use canonical kinds (gap:test-feedback
etc.) — they were using bare slugs that the new validation correctly
rejects.

1596/1596 SF extension tests pass; typecheck clean.

Note on prior dispatch: this work was first attempted via "sf headless -p"
to dogfood the new memory rule (drive SF work through sf headless, not
parallel Claude Code agents). The dispatch ran 49s with 8 tool calls but
landed nothing — the same fragility documented in sf-mp4rxkwb-l4baga
(triage-not-a-first-class-unit-type). Hand-coding fallback applied;
fragility data point added to the open entry's evidence trail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:28:19 +02:00
Mikael Hugo
e2f631901f test(sf-db-migration): bump expected schema version 62 → 63
Schema head moved to v63 in commit 21d905461 (parallel agent's
"rem-agent-inspired memory discipline + always-in-context invariants
board" track) but the migration tests still asserted v62 — flagged in
the last 2 iterations as "pre-existing migration failures, not mine."

Update both schema-version assertions to 63 + add a context_board
table-exists check after the v63 migration so future schema bumps
explicitly require updating both the version assertion AND the
matching table-presence check (catches naked-version-bump skews).

11/11 migration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:19:09 +02:00
Mikael Hugo
2f8ee57256 feat(self-feedback): mirror resolutions into memory-store on success
Addresses sf-mp4rp6y2-31jfau (architecture-defect:self-feedback-not-
wired-to-memory-subsystem). The reflection layer surfaced this as part
of the deepest architectural concern in the 2026-05-14T02-49-45Z report:
"resolutions are hidden from the memory graph, SF will continue to
forget its own triaged solutions and fail to cluster identical root
causes."

When markResolved succeeds against the DB, also call memory-store's
createMemory to mirror the closure as a memory entry that detectors
and reflection passes can consult later via getRelevantMemoriesRanked.

Memory entry shape:
  category: "self-feedback-resolution"
  content: "[<entry.kind>] <entry.summary>\n→ <evidence.kind>: <reason>"
  confidence: 0.9
  source_unit_type: "self-feedback"
  source_unit_id: <entryId>
  tags: [
    <entry.kind>,
    "evidence:<evidence.kind>",
    "commit:<sha-12-prefix>"  // when commitSha present
    "requirement:<reqId>"     // when requirementId present
  ]

Best-effort: any memory-write failure is silently swallowed. The
resolution itself already landed via DB UPDATE + JSONL audit append +
markdown regen — the memory mirror is observability + future detector
consumption, not a correctness requirement. The try/catch ensures a
broken memory subsystem cannot roll back a valid resolution.

Tests (2 new, 13 total in self-feedback-db):
- agent-fix with commitSha → memory entry has [kind, evidence:agent-fix,
  commit:<sha-prefix>] tags + sourceUnitId pointing at the resolved entry
- human-clear without commit → memory entry has [kind, evidence:human-
  clear] tags only, no commit tag

Pre-existing migration failures in sf-db-migration.test.mjs (2 tests:
v27 spec backfill, v52 routing-history heal) are unrelated to this
commit; same failure mode as last iteration. Logged here so the
1591/1593 pass rate is auditable.

The other three siblings of the consolidating reflection entry
(sf-mp4w89mv-3ulqp4) remain open and need schema migration:
- sf-mp4rxkwt-sfthez kind vocabulary (domain:family[:specific])
- sf-mp4rxkwx-jz0soh causal links (self_feedback_relations table)
- sf-mp4rxkx0-fkt3e2 prioritization (impact_score + effort_estimate cols)
This commit lands the writer-layer-only piece (#4 in the rollup's
suggested fix), unlocking detector + reflection consumption immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:16:28 +02:00
Mikael Hugo
6a88ad2f00 refactor(reflection): route through @singularity-forge/ai, drop subprocess + gemini hardcoding
User-correctable architecture defect: runGeminiReflection shelled out to
the `gemini` CLI binary and hardcoded the gemini provider, duplicating
auth discovery and disconnecting the call from SF's metrics, cost
accounting, and provider abstraction. Should have routed through the
existing @singularity-forge/ai layer from the start.

Replace runGeminiReflection with runReflection that:

- Resolves an operator-supplied "provider/modelId" string via
  @singularity-forge/ai's getModel (the canonical accessor for the
  runtime model registry — MODELS itself isn't re-exported).
- Calls completeSimple from @singularity-forge/ai. Same provider routing
  every other SF LLM call uses (anthropic, openai, google-gemini-cli,
  openai-codex-responses, mistral, etc.). No subprocess.
- Default model is google-gemini-cli/gemini-3-pro-preview because that
  matches the operator's primary AI Ultra tier — but the default lives
  in a single named constant (DEFAULT_REFLECTION_MODEL), no provider
  hardcoding in the call path. Operators override per-call via --model.
- Returns { ok, content?, cleanFinish?, error?, provider, modelId } for
  observability into which provider actually answered.

runGeminiReflection kept as an alias for back-compat so the existing
headless-reflect.ts caller works unchanged. New code should use
runReflection directly.

Tests: switched from a fake-gemini-binary-on-PATH approach (5 tests)
to a clean dependency-injection approach via options.complete (5 tests
+ 1 new "rejects bare model strings"). Mock returns AssistantMessage
shape directly, no subprocess machinery.

Two pre-existing migration test failures in sf-db-migration.test.mjs
(openDatabase_migrates_v27, openDatabase_v52_db_heals_routing_history)
are unaffected by this commit — they fail in isolation too, likely
related to commit 7570aac4b's routing-metrics track. Logged here so the
1589/1591 pass rate is auditable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:11:07 +02:00
Mikael Hugo
21d9054611 feat(sf): rem-agent-inspired memory discipline + always-in-context invariants board
Two patterns lifted from Copilot CLI 1.0.47's rem-agent design.

1. add/prune-only consolidation surface (memory-store, memory-extractor)

   - applyConsolidationActions(): new export that gates the extractor path to
     two action kinds only — "add" (→ CREATE) and "prune" (→ SUPERSEDE with
     sentinel superseded_by = "pruned:<unitType>:<unitId>"). UPDATE / REINFORCE /
     SUPERSEDE actions are rejected with a descriptive error from the
     consolidation path; manual paths still use applyMemoryActions and keep
     full action surface.
   - memory-extractor.js EXTRACTION_SYSTEM prompt updated: model is told to
     emit add/prune only and to fix wrong entries by prune+readd, not edit.
   - Discipline win: every consolidation change is visible as an addition or
     removal — no silent revisions.

2. swarm member inheritance of parent memory view (swarm-dispatch)

   - SwarmDispatchLayer.dispatch() now fetches getActiveMemoriesRanked(30)
     and formatMemoriesForPrompt(memories, 2000, false) at dispatch time,
     attaches as memoryContext on both bus metadata and DispatchResult.
   - Snapshot semantics — members get the view at dispatch time, no live
     updates mid-task.
   - Resolves the TODO at swarm-dispatch.js:22.

3. always-in-context invariants board (new capability)

   - New src/resources/extensions/sf/context-board.js — SQLite-backed,
     per-repo/per-branch entries. Two ops: addBoardEntry, pruneBoardEntry
     (no update — same discipline as #1). 4 KB byte cap in
     formatBoardForPrompt with truncation marker.
   - New src/resources/extensions/sf/tools/context-board-tool.js +
     bootstrap/context-board-tool.js — registered via pi.registerTool with
     two ops: add(content, category?) and prune(id). Repository + branch
     auto-filled from git context.
   - Schema migration v62 → v63 in sf-db-schema.js adds context_board table
     + idx_context_board_repo_branch index. ensureContextBoardTable wired
     into initSchema for fresh databases.
   - System-prompt injection at auto/phases-dispatch.js runDispatch right
     after dispatchResult.prompt resolution: prepends board snapshot under
     a labeled section. Try/catch fail-open — board errors never break
     dispatch. Sidecar/custom-engine paths intentionally not covered (carry
     full unit context already + low frequency).

Why these complement existing infra rather than replace:
- memory-store remains queryable (recall on demand) for facts the agent
  references sometimes.
- context_board is always-rendered (small, prompt-injected) for invariants
  the agent should never operate without — current milestone scope,
  architectural rules, known-broken paths, in-flight migrations.

Comparison to Copilot rem-agent:
- We have what they have on consolidation (add/prune + board) plus what
  SF already had (queue + drain + memory-extractor + SLEEPTIME swarm
  topology that's richer than their single-agent rem-agent).

Tests: 40/40 pass across memory-consolidation-discipline.test.ts (18) and
context-board.test.ts (22). Full test:unit deferred — see follow-up.

Two parallel Sonnet 4.6 sub-agents in isolated worktrees produced the
work; integration adapted for the modular sf-db split (schema went into
sf-db/sf-db-schema.js, prompt injection into auto/phases-dispatch.js,
both of which got pulled out of their original files since the swarms
launched).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:08:31 +02:00
Mikael Hugo
f68ab20953 fix(ai): backfill MiniMax M2/M2.1 cacheRead pricing 2026-05-14 04:55:46 +02:00
Mikael Hugo
4af10ac1b2 fix(self-feedback): verify agent-fix commit_sha exists in repo
Partially addresses sf-mp4rxkwn-jmp039 (no-outcomes-verification): AC1
and AC3 land here. AC2 (cross-check that the cited commit's changed
files include the entry's referenced files) is filed separately as a
follow-up — different mechanism (semantic AC parsing).

Without this check, an agent could stamp ANY string as commit_sha and
markResolved would accept it under the writer-layer constraint shipped
in d477ce703. The credibility check at the reader caught the OBVIOUS
non-canonical shapes (null evidence, {file, line}) but a well-formed
{kind: "agent-fix", commitSha: "phantom-sha"} would have passed.

Implementation:

verifyCommitExists(commitSha, basePath) returns one of:
  - "verified"    — git is present and the commit is in the repo
  - "missing"     — git is present but the commit lookup failed
  - "ungrokable"  — git unavailable or basePath isn't a git repo
                    (carve-out: we can't verify, so don't punish)

markResolved policy: reject on "missing"; accept on the others. The
agent-fix-unverified kind (reserved in d477ce703) is the explicit
escape hatch for "I want to mark agent-fix but can't cite a verifiable
commit" — those resolutions remain re-includable under the credibility
check, which is what we want.

Implementation uses two shell-outs to git (rev-parse --verify, then
rev-parse --git-dir to distinguish missing from not-a-repo). Both are
guarded with 5-second timeouts and never throw — failure modes return
"ungrokable" so the carve-out kicks in.

Tests: 2 new (11 total in self-feedback-db).
  - rejects_agent_fix_with_nonexistent_commit_sha: initializes a real
    git repo, files an entry, rejects bogus SHA, accepts real HEAD SHA
  - accepts_agent_fix_with_no_commit_sha_or_ungrokable_path: covers
    both the carve-out (no-git) and agent-fix-without-commitSha
    (testPath/summaryNarrative path)

Full SF extension suite (1549 tests) passes; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 04:44:04 +02:00
Mikael Hugo
d477ce7039 fix(self-feedback): reject non-canonical evidence at the writer layer
Addresses sf-mp4qoby4-meiir7: the credibility check at the READER side
of self-feedback (selectInlineFixCandidates) was previously the only
gate. An agent that wrote DB rows directly via raw SQL or the wrong
tool could bypass it, landing resolutions like {file, line} or null
that the reader would then either trust (legacy carve-out) or quietly
re-open. Observed live in 2026-05-13 dogfood (5/5 sloppy resolutions
with non-canonical evidence shapes).

This commit makes the policy belt-and-suspenders: markResolved (and by
extension resolveSelfFeedbackEntry) refuse to write resolutions whose
evidence.kind is not in the accepted set:
  agent-fix, human-clear, promoted-to-requirement, auto-version-bump,
  agent-fix-unverified (reserved for outcomes-verification follow-up)

When evidence is missing, non-object, or its kind is outside the set,
markResolved returns false WITHOUT touching the DB or JSONL — caller
recovers by re-submitting with a valid kind. All existing callers
(resolve_issue tool, requirement-promoter, auto-version-bump resolver,
triage-self-feedback) already pass valid kinds; no breakage.

Raw SQL bypass is a known limit documented in the entry — full
coverage needs a DB CHECK constraint on resolved_evidence_json (schema
migration, separate work).

Tests: 2 new (markResolved_rejects_non_canonical, accepts_each_canonical)
covering all four rejection paths (bad kind, missing kind, missing
evidence, unknown kind) and all five accepted kinds. Full SF extension
suite (1547 tests) passes; typecheck clean.

Plus inline cleanup: closed 3 stale upstream-rollup re-files
(sf-mp4qyotx, sf-mp4qyoub, sf-mp4qyouh) with human-clear evidence —
the bridge fix in 6d27cba06 now prevents recurrence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 04:40:52 +02:00
Mikael Hugo
6d27cba067 fix(upstream-bridge): suppress re-file of recently-closed rollup kinds
Addresses sf-mp4rp6xn-hpag5h: bridgeUpstreamFeedback's idempotency
check only looked at currently-OPEN upstream-rollup entries, so any
closure (human-clear or agent-fix) would let the bridge re-file the
same cluster on the next session_start. Observed live during 2026-05-13
dogfood: closed 3 upstream-rollup entries with human-clear, bridge
re-filed all 3 on the next run.

Change: extend the idempotency set to also exclude rollup kinds that
were RESOLVED within the last 30 days (matches the existing
THIRTY_DAYS_MS upstream-source cutoff — same window, same rationale).

Closures are treated as time-limited: after the window expires, a
re-cluster CAN re-file, because the original closure was made against
then-current state and later state may legitimately surface the same
kind again. This is the right balance — operators get respite from
re-files while the closure decision was fresh, without trapping the
ledger forever if conditions actually change.

7 new tests cover the regression (files new / skips open / skips
recently-closed / allows re-file after window / threshold guards /
non-forge-repo bail). Full SF extension suite (1545 tests) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 04:37:10 +02:00
Mikael Hugo
62b19d7ba4 feat(reflection): wire LLM dispatch (sf headless reflect --run)
Phase 1B of the reflection layer: complete the operator-driven loop by
adding actual LLM dispatch. Phase 1A (commit e161a59e2) shipped the
corpus assembler + prompt template + the prompt-emit operator surface.
This commit wires the dispatch end so `sf headless reflect --run`
produces a real report on disk without manual model piping.

Why shell-out to the gemini CLI and not SF's provider abstraction:
reflection is a single-prompt one-shot inference. Going through SF's
full agent dispatch would require a session, model registry, tool
registration, recovery shell — overkill for "render this prompt,
capture text." The gemini CLI handles auth (~/.gemini/oauth_creds.json),
Code Assist project discovery, and protocol drift on its behalf.
Subprocess cost is paid once per reflection (rare).

Implementation:

- reflection.js: runGeminiReflection(prompt, options) spawns
  `gemini --yolo --model <model> -p "<directive>"` and pipes the giant
  rendered template via stdin (gemini -p reads stdin and appends).
  Returns { ok, content, cleanFinish, exitCode, error, stderr }; never
  throws. Defaults to gemini-3-pro-preview (0% used on AI Ultra,
  strongest agentic model with quota). 8-minute timeout.

  cleanFinish detected by REFLECTION_COMPLETE terminator (emitted by
  the prompt template's output contract) — operator gets a warning when
  the report is truncated.

- headless-reflect.ts: --run flag triggers dispatch + report write
  via writeReflectionReport. --model overrides the default. Errors
  surface as JSON or text per --json. Successful runs emit the report
  path on stdout; failures emit error + truncated stderr.

- help-text.ts: documents --run and --model flags.

- Tests (4 new, 13 total): use a fake `gemini` binary on PATH to
  exercise the spawn path without real OAuth/network — covers
  ok+cleanFinish, non-zero exit, hang/timeout, missing-terminator.

All 1538 SF extension tests pass; typecheck clean.

Phase 2 follow-up (still gated on sf-mp4rxkwb-l4baga
triage-not-a-first-class-unit-type landing): reflection-pass becomes a
real autonomous-loop unit type, milestone-close auto-triggers it, the
report's `Recommended new self-feedback entries` section gets parsed
and the entries auto-filed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 04:33:16 +02:00
Mikael Hugo
e161a59e2f feat(reflection): add Phase 1A reflection layer (corpus + prompt + sf headless reflect)
Addresses self-feedback entry sf-mp4uzvcd-pazg6v
(architecture-defect:no-reflection-layer-over-self-feedback-corpus): SF
detected symptoms and triaged individual entries but had no layer that
reasoned about the corpus to recognize recurring structural patterns.
The same architectural pressure expressed itself across multiple entries
with different exact-kind strings; nothing escalated the pattern to a
class. The cognitive work fell on the operator.

This commit ships Phase 1A — the data-assembly + prompt half of the
reflection layer + an operator-driven entry point. Phase 1B (LLM dispatch
via the autonomous loop as a real unit type) lands once
sf-mp4rxkwb-l4baga (triage-not-a-first-class-unit-type) is in.

Files:
- src/resources/extensions/sf/reflection.js (new)
  - assembleReflectionCorpus(basePath): bundles open + recent-resolved
    self-feedback (full json), last 50 commits via git log, milestone +
    slice + task state, all milestone validation verdicts, and prior
    reflection report into one struct. Returns null on prerequisite
    failure (DB closed) so callers downgrade gracefully.
  - renderReflectionCorpusBrief(corpus): renders the corpus into a
    markdown brief the LLM consumes in one turn.
  - writeReflectionReport(basePath, content): persists to
    .sf/reflection/<timestamp>-report.md so next pass detects "what
    changed since last reflection."

- src/resources/extensions/sf/prompts/reflection-pass.md (new)
  - {{include:working-directory}} prefix.
  - Reasoning order: cluster by structural shape (not exact kind),
    identify recurring patterns, identify commit/ledger gaps, identify
    stale validation drift, identify the deepest architectural concern,
    compare against prior report.
  - Output contract: structured markdown report with named sections,
    terminator REFLECTION_COMPLETE for clean-finish detection.
  - Constraints: don't fix anything (reflection layer not executor),
    don't resolve entries without commit-SHA evidence, don't invent IDs.

- src/headless-reflect.ts (new) — sf headless reflect [--json]
  - Pre-opens the project DB via auto-start.openProjectDbIfPresent
    (one-shot bypass path doesn't run the full SF agent bootstrap).
  - Default: emits the rendered prompt brief (template + corpus) for
    operators to pipe into any model. Lets the corpus-assembly layer
    ship and validate before the LLM-dispatch layer is wired.
  - --json: emits raw corpus snapshot for tooling.

- src/headless.ts: registers the new "reflect" command after the
  existing usage block.
- src/help-text.ts: documents it in the headless command list.

- src/resources/extensions/sf/tests/reflection.test.mjs (new, 9 tests):
  null-when-DB-closed; collects open + recent-resolved; excludes >30d
  resolutions; captures milestone/slice/task tree; captures validation
  verdicts; commits returned as array (best-effort tmpdir is ok); brief
  renders all major sections; entry IDs/severity/kind appear in brief;
  writeReflectionReport round-trips through assembleReflectionCorpus's
  previousReport read.

Live smoke verified: sf headless reflect against the real .sf/sf.db
returns 15 open + 23 recent-resolved entries, 50 commits, 2 milestones,
1 validation file (correctly surfacing M001's stale needs-attention
verdict against actual 5/5 slices done — exactly the case that
motivated this layer).

Total: +848 LOC, full SF extension suite (1534 tests) passes,
typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 04:27:29 +02:00
Mikael Hugo
7570aac4b7 feat(sf): generation-aware failover + canonical-keyed metrics
Two parallel refactors building on the model-registry consolidation:

1. Generation-aware failover (model-route-failure.js, agent-end-recovery.js)

   - resolveNextModelRoute now takes unitType so it knows whether the
     caller is solver-pinned per ADR-0079 (autonomous-solver). When pinned,
     rejects candidates whose canonicalIdFor() differs from the failed
     route's canonical id — closes the latent solver-invariant violation
     where kimi-coding/kimi-k2.6 could silently fail over to
     ollama-cloud/kimi-k2.5:cloud (different generation).
   - Cross-generation failover in non-pinned units now emits a structured
     logWarning so generation downgrades are visible in traces instead of
     looking like an equivalent route switch.

2. Canonical-keyed performance metrics (model-learner.js)

   - .sf/model-performance.json now keys by canonical_id with an
     {aggregate, by_route} sub-shape instead of fused provider/wire-model
     strings. Cross-route history per model is now coherent — kimi-k2.6
     reached via kimi-coding accumulates into the same aggregate as
     reached via openrouter.
   - Migration runs at boot: detects old shape (no 'aggregate' key in
     unit-type blob values), distributes each entry into by_route,
     recomputes aggregate, writes a backup to
     .sf/model-performance.json.pre-canonical-backup. Unmappable route
     keys land in _unmapped so nothing is dropped.
   - getRouteStats(taskType, routeKey) added for per-route failover
     ordering; existing getRankedModels emits canonical IDs for
     cross-route strength queries.

3. Tests

   - model-registry.test.ts: bundled in this commit (Swarm A's test file
     was left untracked when the registry module was committed).
   - model-route-failure.test.ts: 12 tests covering solver-pin guard,
     same-canonical multi-route failover, generation-downgrade log emit.
   - model-learner-canonical.test.ts: 17 tests covering migration
     round-trip, aggregate invariant, _unmapped bucket, and zero-default
     reads.
   - model-learner.test.ts: one existing test updated for the new
     _unmapped.by_route shape on bare model IDs.

4. Results

   - Targeted tests: 147/147 across registry, route-failure, learner,
     learner-canonical.
   - Full npm run test:unit: 4707 pass, 0 fail, 83 skipped (no new
     regressions vs pre-edit baseline of 4669).

Work parallelized across two Sonnet 4.6 sub-agents in isolated git
worktrees. Contract authored in docs/dev/drafts/model-registry-contract.md
(committed earlier in 1d753af6b) and consumed by both agents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 04:15:08 +02:00
Mikael Hugo
09bc50f0f6 feat(openai-codex): mirror codex CLI's models_cache.json into SF catalog
The static catalog in models.generated.ts carries phantom slugs like
gpt-5-codex / gpt-5.1-codex / gpt-5.1-codex-max / gpt-5.2-codex that the
ChatGPT-account API rejects with HTTP 400 ("model is not supported when
using Codex with a ChatGPT account"). Verified live on this machine:

  ERROR: "The 'gpt-5-codex' model is not supported when using Codex with
         a ChatGPT account."

Meanwhile the actually-supported slugs for a ChatGPT subscription
(gpt-5.5 default, gpt-5.4, gpt-5.4-mini, gpt-5.3-codex, gpt-5.2) are
not in SF's view at all — so the router scores phantoms, picks one,
dispatch fails, no successful routes record, and routing silently drifts.

The codex CLI itself maintains ~/.codex/models_cache.json with the
authoritative "what THIS account can actually serve" list (visibility +
supported_in_api flags). SF reads that file directly — no duplicate
discovery, no separate API call, single source of truth.

Changes:

- src/resources/extensions/sf/openai-codex-catalog.js (new) — pure file
  reader. Resolves CODEX_HOME (or ~/.codex), parses models_cache.json,
  filters by visibility==="list" AND supported_in_api===true, mirrors the
  result into .sf/runtime/model-catalog/openai-codex.json. Same cache
  shape as the generic model-catalog-cache and gemini-catalog modules
  so getKnownModelIds picks it up transparently.

- bootstrap/register-hooks.js — wire scheduleOpenaiCodexCatalogRefresh
  into session_start, parallel to the existing gemini and generic
  catalog refreshes.

- Tests (9): cache-missing, malformed, filter correctness against the
  real shape, no-pass-through, slug validation, refresh-writes-cache,
  cache-fresh-skips-refresh, and live discovery via the smoke probe
  returns exactly ["gpt-5.5", "gpt-5.4", "gpt-5.4-mini", "gpt-5.3-codex",
  "gpt-5.2"] on this machine.

Asymmetry vs gemini-cli is appropriate: codex CLI caches locally so SF
just reads the file; gemini-cli does not, so SF's gemini path calls
setupUser + retrieveUserQuota over the wire. Each provider gets the
cheapest reliable discovery path.

Follow-up filed separately: extract codex transport
(codex-app-server-client.ts, openai-codex-responses.ts, this catalog
reader) into a dedicated @singularity-forge/openai-codex-provider
package mirroring the gemini-cli-provider structure for symmetry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:53:34 +02:00
Mikael Hugo
383e495085 feat(headless,gemini-cli): add sf headless usage + unify gemini quota path
Adds a machine-readable headless surface for live LLM-provider usage and
unifies the gemini-cli quota fetch through one helper, removing the
duplication that existed between usage-bar.js and the new package.

1. snapshotGeminiCliAccount in @singularity-forge/google-gemini-cli-provider

   - Single source of truth for { projectId, userTierId, userTierName,
     paidTier, models[] } via setupUser + retrieveUserQuota.
   - Dedups buckets per modelId, keeping the worst (lowest remainingFraction)
     so consumers always see the most-restrictive window. Code Assist
     sometimes returns multiple buckets per model; the pessimistic choice
     is what every consumer needs.
   - discoverGeminiCliModels(cwd?) wraps it for catalog-cache callers that
     only need the IDs.

2. sf headless usage subcommand

   - New src/headless-usage.ts handler. text (default) and --json output.
     Uses the package's snapshot directly — no RPC child, no jiti
     gymnastics — matching the shape of headless-uok-status / headless-doctor.
   - Wired into src/headless.ts after the doctor block.
   - Help text adds the command line.

3. usage-bar.js refactored to delegate

   - fetchGeminiUsage no longer imports gemini-cli-core directly. It calls
     snapshotGeminiCliAccount and reshapes the result into the existing
     { provider, displayName, windows[] } UI contract.
   - Eliminates the duplicate setupUser + retrieveUserQuota code path.
   - The fast existsSync(~/.gemini/oauth_creds.json) pre-flight stays
     so unauth'd users get a friendly message without paying for OAuth
     bootstrap.

4. Model registry refactor (separate track committed alongside)

   - src/resources/extensions/sf/model-registry.ts (new) consolidates
     canonical model identity, capability tier, and generation tags into
     one source of truth that auto-model-selection, benchmark-selector,
     and model-router now consume instead of maintaining parallel maps.

All 1487 tests pass (151 files); typecheck clean for both the package
and the SF extensions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:42:53 +02:00
Mikael Hugo
c6a3fa6a6a feat(gemini-cli): discover account models via gemini-cli-core + retry on capacity errors
Two related fixes for the google-gemini-cli provider, both motivated by today's
dogfood diagnosis: SF was pinned to a single model (gemini-3-flash-preview)
even though the AI Ultra account has access to seven (verified via the live
gemini-cli-core probe), and a transient "No capacity available for model X
on the server" was classified as `unknown` so SF gave up instead of retrying.

1. Account snapshot + model discovery in @singularity-forge/google-gemini-cli-provider

   - Add `snapshotGeminiCliAccount(cwd?)` returning { projectId, userTierId,
     userTierName, paidTier, models } where `models[]` carries each modelId
     with usedFraction, remainingFraction, and resetTime. Built on the same
     setupUser + CodeAssistServer.retrieveUserQuota path usage-bar.js
     already uses, but extracted to the dedicated package so any consumer
     (model picker, capacity diagnostics, catalog cache) can call one helper.
   - Add `discoverGeminiCliModels(cwd?)` as a thin "just the IDs" wrapper.
   - Both are best-effort: any failure (OAuth expired, no project, network)
     returns null silently — never throws.

2. SF-side cache writer at src/resources/extensions/sf/gemini-catalog.js

   - Delegates discovery to the package; only handles cache file path,
     6-hour TTL, and the session_start lifecycle hook.
   - Cache lands at .sf/runtime/model-catalog/google-gemini-cli.json with
     the same shape as the generic model-catalog-cache, so getKnownModelIds
     and the model picker pick it up transparently.
   - Wired into bootstrap/register-hooks.js session_start in parallel with
     the existing scheduleModelCatalogRefresh (the generic REST + API-key
     path can't reach gemini-cli's OAuth-only Code Assist endpoint).

3. Capacity error classification fix

   - error-classifier.js SERVER_RE now matches "no capacity (available|left)",
     "capacity (unavailable|exhausted)", and "no capacity ... on the server".
     Previously these fell through to kind=unknown, which is not transient,
     so agent-end-recovery never retried — even though the same handler
     already caps gemini-cli rate-limit backoff at 30s for exactly this
     class of transient. With the pattern matched as `server`, the existing
     retry-with-backoff path covers it.

The full extension test suite (1386 tests) passes. Typecheck clean for both
the package and the SF extensions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:32:35 +02:00
Mikael Hugo
1d753af6b6 docs(dev): draft model registry contract for upcoming refactor
Spec for consolidating the three alias tables (benchmark-selector,
auto-model-selection, model-router) into a single SF-extension registry
that reads from @singularity-forge/ai's MODELS and enriches it with
canonical_id, generation, and tier. Shared interface for parallel
Swarm A/B/C work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 02:57:27 +02:00
Mikael Hugo
f0f31989fe refactor(autonomous-solver): extract prompt strings to .md templates
Lands the prompt extraction the triage worker performed in dogfood
round 5 on entry sf-mp37p9u6-eyobzb (inconsistency:prompts-monolithic-
not-modular).

Changes:
- prompts/autonomous-solver-contract.md (new): solver loop block, with
  {{include:working-directory}} for the shared prefix.
- prompts/autonomous-executor-contract.md (new): executor loop block,
  same fragment include.
- prompts/autonomous-solver-pass.md (new): solver-pass classifier.
- autonomous-solver.js: _buildAutonomousLoopPromptPrefix renamed to
  buildAutonomousLoopVars and returns the variables for the new
  templates instead of a pre-rendered string. Net -120/+60 lines.

The {{include:fragment}} syntax is already supported by prompt-loader.js
and the working-directory fragment already exists at
prompts/fragments/working-directory.md.

All 1386 tests pass; typecheck clean.

Resolves: sf-mp37p9u6-eyobzb (inconsistency:prompts-monolithic-not-modular)
Co-resolved: sf-mp37p9u0-hebruv (architectural-risk:single-transaction-
migration) — already verified-and-closed by the triage worker via
resolve_issue with kind=agent-fix, evidence "migrateSchema already
uses per-migration BEGIN/COMMIT via runMigrationStep". JSONL audit log
captured the resolution event end-to-end through the new
appendResolutionToJsonl path (commit ce58d3223).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 02:41:46 +02:00
Mikael Hugo
79db5704bc fix(self-feedback): require structured evidence kind for trusted resolution
Dogfood of the triage worker revealed that the agent can bypass the
resolve_issue tool (which hardcodes kind=agent-fix) and write DB rows
directly with non-canonical evidence shapes (null, or {file, line}).
The earlier credibility check trusted any resolution that had a prose
resolvedReason — a "legacy narrative" carve-out meant to preserve
operator clears predating structured evidence. Brand-new sloppy agent
resolutions slipped through that carve-out: 5/5 of today's triage
resolutions had non-canonical evidence and would have been treated as
authoritative under the old check.

Replace the denylist/legacy-carve-out with an allowlist:
- isSuspectlyResolved returns true unless resolvedEvidence.kind is
  in {agent-fix, human-clear, promoted-to-requirement}.
- SUSPECT_RESOLUTION_KINDS is kept as documentation of the
  auto-version-bump case but the allowlist makes it redundant for
  the actual policy decision.

Tests now cover both failure modes: prose-only resolution (no kind)
and non-canonical evidence shape ({file, line}) both re-include the
entry as a candidate. Legacy entries that genuinely lack an evidence
kind are backfilled to kind=human-clear separately so they keep their
resolution under the stricter check.

A self-feedback entry (sf-mp4qoby4-meiir7, severity=high) was filed
about the underlying bypass — markResolved should ALSO reject or
auto-tag non-canonical writes at the writer layer, since the reader
is currently the only gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 02:17:47 +02:00
Mikael Hugo
6e95c3542c fix(bootstrap): always dispatch self-feedback triage on session_start
The session_start hook only invoked dispatchSelfFeedbackInlineFixIfNeeded
when triage.stillBlocked contained at least one high/critical entry.
After the previous commit rewired the worker as a triage queue that
returns every open forge-local entry (not just high/critical), this
gate stranded medium/low backlog forever at startup — the unit was
never given a chance to triage them.

The dispatcher's own selectInlineFixCandidates is now the source of
truth for eligibility; the call site should call unconditionally.
Keep the high/critical-specific notify (still useful operator signal
when the loud ones are present) but stop using it to gate the dispatch.

The turn_end hook at the bottom of register-hooks.js already calls
the dispatcher unconditionally, so this change aligns the two paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:59:13 +02:00
Mikael Hugo
ce58d32231 fix(self-feedback,state): close two state-drift gaps
1. Self-feedback JSONL is now a real append-only audit log. Previously
   markResolved updated the DB row in place but never echoed the
   resolution to JSONL, so a DB rebuild via importLegacyJsonlToDb would
   re-import all entries with their original pre-resolution state and
   silently lose every resolution that had ever landed. The JSONL was a
   half event log — creations yes, resolutions no.

   - Introduce a `recordType: "resolution"` JSONL record shape. Append
     one of these to the project JSONL whenever markResolved succeeds
     against the DB. Best-effort: failure to append never blocks the
     resolution itself.
   - Extend importLegacyJsonlToDb to handle both record types. Entry
     creations go through insertSelfFeedbackEntry (ON CONFLICT DO
     NOTHING — idempotent). Resolution events go through
     resolveSelfFeedbackEntry, which is already a no-op on missing or
     already-resolved rows, so replay is idempotent.
   - Tests cover: the appended record shape; a DB rebuild correctly
     reconstructing resolved_at/resolved_evidence_json from a JSONL
     audit trail; orphan resolution events (entry never existed) are a
     silent no-op.

   Closes self-feedback entry sf-mp4ikbta-2zcbhh.

2. The reconcile path at state-db.js:reconcileSliceTasks warns when an
   on-disk SUMMARY.md exists for a task whose DB row is still pending
   and refuses to silently import — a safety check so autonomous runs
   can't promote themselves to complete by writing a SUMMARY without a
   real DB transition. But operators had no remediation path when the
   drift was real (lost DB write, hand edit). They had to mutate the
   DB by hand.

   - New `state-reconcile.js` with `reconcileTaskFromSummary` exposes
     the remediation explicitly. Parses the SUMMARY via the existing
     parseSummary helper, validates via isValidTaskSummary, and writes
     status / completed_at / verification_result / blocker /
     key_files / full_summary_md into the DB row through a new
     `setTaskSummaryFields` helper in sf-db-tasks.
   - Returns structured { ok, reason, applied } outcomes — never
     throws — so operator tooling can branch on `db-unavailable`,
     `summary-missing`, `summary-invalid`, `task-not-in-db`,
     `already-done`.
   - The reconcile warning text now points at the helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:55:30 +02:00
Mikael Hugo
5f245b721d fix(self-feedback): rewire inline-fix worker as triage queue
The inline-fix worker was a partial repair queue — it picked only
high/critical+blocking entries plus my recent gap/architecture-defect
override and left everything else (medium inconsistencies, janitor gaps,
architectural-risks, low-severity gaps) sitting open forever. The
requirement-promoter clusters by exact `kind` string and never fires on
diverse forge-local entries (every open entry currently has a unique
kind), so there is no other sweep that ever touches these. They just
accumulate.

The point of the worker is triage, not just repair: every open entry
should get an eyes-on per session and reach one of three outcomes —
fix, promote to requirement, or close as not-of-value with reason.
Closing deliberately is a valid, expected outcome.

Changes:

- `selectInlineFixCandidates` now returns every open forge-local entry,
  modulo the existing credibility check that re-includes suspect
  resolutions. Severity and blocking filters are gone; the kind-based
  override is no longer needed because everything qualifies.
- The dispatch prompt is rewritten as a three-way triage protocol
  (Fix / Promote / Close) with explicit guidance per outcome and
  explicit prohibition on the `auto-version-bump` evidence kind (which
  would re-open under the credibility check).
- Tests collapse the three filter-coverage tests into a single
  "selects every open forge-local entry" assertion that exercises the
  full severity × blocking × kind matrix.

Upstream feedback is still excluded — those entries describe behavior
in other repos that the inline-fix unit cannot directly repair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:46:24 +02:00
Mikael Hugo
085beb5199 docs(sf-ace): restore parked location + keep ADR cross-references
SF's S05/T02 executor moved the doc back to docs/dev/sf-ace-patterns.md
while completing the slice (correctly: that was the task's stated
deliverable location). The doc is parked under docs/dev/drafts/ because
ACE Coder has no active consumer for it; re-park it.

Keep the ADR-019 / ADR-020 cross-references the executor added —
they are real content improvements over the previous version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:24:12 +02:00