ADR-0075: UOK Gate Architecture

Status: Accepted
Date: 2026-05-06
Deciders: UOK subsystem migration (M013 S04)

Context

The Unit Orchestration Kernel (UOK) post-unit verification flow originally had a single ad-hoc gate: the Security Gate (secret scanning). As the autonomous loop matured, we needed a structured, extensible way to enforce policy, verify correctness, learn from outcomes, and stress-test durability — without bloating the kernel loop with inline conditionals.

Decision

We adopt a gate-runner pattern with explicitly typed gates, a uniform execution contract, durable audit logging, and a configurable retry matrix.

Gate Contract

Every gate implements:

id: string — unique identifier (e.g. "cost-guard")
type: string — "security" | "policy" | "verification" | "learning" | "chaos"
execute(ctx: UokContext, attempt: number): Promise<GateResult>

The UokContext carries traceable identifiers (traceId, turnId, unitType, unitId, modelId, provider) plus runtime telemetry (tokenCount, costUsd, durationMs).

The GateResult is a sealed union:

outcome: "pass" | "fail" | "retry" | "manual-attention"
failureClass: "policy" | "verification" | "execution" | "artifact" | "git" | "timeout" | "input" | "closeout" | "manual-attention" | "unknown"
rationale: string — human-readable explanation
findings?: string — structured output (diffs, logs, cost breakdowns)
recommendation?: string — actionable next step

Retry Matrix

The UokGateRunner consults a per-failure-class retry ceiling:

failureClass	max retries
policy, input, manual-attention	0
execution, artifact, verification, git	1
timeout	2
unknown	0

Retries are persisted to the gate_runs SQLite table and emitted as audit events so operators can reconstruct the full retry chain.

Implemented Gates

Gate	Type	Purpose	Durable Store
SecurityGate	security	Run `scripts/secret-scan.sh` against uncommitted changes	N/A (external script)
CostGuardGate	policy	Enforce per-unit and per-hour USD budgets; detect high-tier model burn	`llm_task_outcomes` (SQLite) + `model-cost-table.js`
OutcomeLearningGate	learning	Detect failure patterns by model, unit type, and escalation rate	`llm_task_outcomes` (SQLite)
MultiPackageGate	verification	Verify only affected workspace packages and downstream dependents	N/A (git + package.json)
ChaosMonkey	chaos	Inject latency, partial failures, disk stress, memory pressure	N/A (ephemeral)

Durable Message Bus

The MessageBus persists messages to .sf/sf.db (uok_messages and uok_message_reads) with at-least-once delivery. The old .sf/runtime/uok-messages.jsonl and per-agent inbox JSON files are legacy artifacts only; normal runtime message state is SQLite-backed. Messages are pruned by TTL (retentionDays, default 7) and inbox size is capped (maxInboxSize, default 1000).

Chaos Engineering Safety

ChaosMonkey is opt-in only (active: false by default). It injects recoverable faults only:

Latency delays (configurable max)
Retryable thrown errors (err.code = "CHAOS_INJECTED")
Disk stress (temp files written then immediately deleted)
Memory stress (buffers allocated then released)

It never sends SIGKILL or mutates production state.

Consequences

Positive:

Adding a new gate is a single file + registration line — no kernel loop changes.
Every gate execution is auditable in SQLite and parity JSONL.
Retry policy is data-driven, not hard-coded per gate.
Cost and outcome learning are grounded in real ledger data, not heuristics.

Negative / Mitigated:

Gate execution adds latency to the verification path. Mitigation: gates run in parallel where possible; timeout defaults are conservative (10s for git diff, 120s for typecheck).
SQLite queries on the critical path could block. Mitigation: queries are simple indexed SELECTs; the DB is local and WAL-mode.
ChaosMonkey in a CI environment could destabilize builds. Mitigation: it is explicitly opt-in and defaults to active: false.

Alternatives Considered

Inline conditionals in auto-verification.js — rejected because it creates a monolithic, untestable verification block.
Plugin system with dynamic import() — rejected because ESM dynamic imports in an extension context add unnecessary complexity; static imports + a registry Map are sufficient.
Separate microservices for cost/outcome learning — rejected because the SF design principle keeps all state on disk in .sf/; adding network boundaries violates the single-writer invariant.

Testing Strategy

Every gate has dedicated behavioral tests in tests/uok-gates.test.mjs:

SecurityGate: missing script, passing scan, failing scan.
CostGuardGate: empty ledger (pass), unit budget exceeded (fail), hourly budget exceeded (fail), high-tier failure pattern (fail).
OutcomeLearningGate: empty ledger (pass), unit failure rate high (fail), model failure rate high (fail), escalation pattern (fail).
ChaosMonkey: inactive (no-op), latency injection, partial failure, disk stress, event clearing.

uok-message-bus.test.mjs covers send/receive, broadcast, persistence across reconstruction, read-state persistence, compaction, conversation filtering, and max-size enforcement.

uok-unit-runtime.test.mjs covers FSM transitions, terminal-status classification, retry budgets, synthetic-unit blocking, and record IO (write/read/clear/list).

5.6 KiB Raw Blame History