5.6 KiB
ADR-0075: UOK Gate Architecture
Status: Accepted
Date: 2026-05-06
Deciders: UOK subsystem migration (M013 S04)
Context
The Unit Orchestration Kernel (UOK) post-unit verification flow originally had a single ad-hoc gate: the Security Gate (secret scanning). As the autonomous loop matured, we needed a structured, extensible way to enforce policy, verify correctness, learn from outcomes, and stress-test durability — without bloating the kernel loop with inline conditionals.
Decision
We adopt a gate-runner pattern with explicitly typed gates, a uniform execution contract, durable audit logging, and a configurable retry matrix.
Gate Contract
Every gate implements:
id: string— unique identifier (e.g."cost-guard")type: string—"security" | "policy" | "verification" | "learning" | "chaos"execute(ctx: UokContext, attempt: number): Promise<GateResult>
The UokContext carries traceable identifiers (traceId, turnId, unitType, unitId, modelId, provider) plus runtime telemetry (tokenCount, costUsd, durationMs).
The GateResult is a sealed union:
outcome: "pass" | "fail" | "retry" | "manual-attention"failureClass: "policy" | "verification" | "execution" | "artifact" | "git" | "timeout" | "input" | "closeout" | "manual-attention" | "unknown"rationale: string— human-readable explanationfindings?: string— structured output (diffs, logs, cost breakdowns)recommendation?: string— actionable next step
Retry Matrix
The UokGateRunner consults a per-failure-class retry ceiling:
| failureClass | max retries |
|---|---|
| policy, input, manual-attention | 0 |
| execution, artifact, verification, git | 1 |
| timeout | 2 |
| unknown | 0 |
Retries are persisted to the gate_runs SQLite table and emitted as audit events so operators can reconstruct the full retry chain.
Implemented Gates
| Gate | Type | Purpose | Durable Store |
|---|---|---|---|
| SecurityGate | security | Run scripts/secret-scan.sh against uncommitted changes |
N/A (external script) |
| CostGuardGate | policy | Enforce per-unit and per-hour USD budgets; detect high-tier model burn | llm_task_outcomes (SQLite) + model-cost-table.js |
| OutcomeLearningGate | learning | Detect failure patterns by model, unit type, and escalation rate | llm_task_outcomes (SQLite) |
| MultiPackageGate | verification | Verify only affected workspace packages and downstream dependents | N/A (git + package.json) |
| ChaosMonkey | chaos | Inject latency, partial failures, disk stress, memory pressure | N/A (ephemeral) |
Durable Message Bus
The MessageBus persists messages to .sf/sf.db (uok_messages and uok_message_reads) with at-least-once delivery. The old .sf/runtime/uok-messages.jsonl and per-agent inbox JSON files are legacy artifacts only; normal runtime message state is SQLite-backed. Messages are pruned by TTL (retentionDays, default 7) and inbox size is capped (maxInboxSize, default 1000).
Chaos Engineering Safety
ChaosMonkey is opt-in only (active: false by default). It injects recoverable faults only:
- Latency delays (configurable max)
- Retryable thrown errors (
err.code = "CHAOS_INJECTED") - Disk stress (temp files written then immediately deleted)
- Memory stress (buffers allocated then released)
It never sends SIGKILL or mutates production state.
Consequences
Positive:
- Adding a new gate is a single file + registration line — no kernel loop changes.
- Every gate execution is auditable in SQLite and parity JSONL.
- Retry policy is data-driven, not hard-coded per gate.
- Cost and outcome learning are grounded in real ledger data, not heuristics.
Negative / Mitigated:
- Gate execution adds latency to the verification path. Mitigation: gates run in parallel where possible; timeout defaults are conservative (10s for git diff, 120s for typecheck).
- SQLite queries on the critical path could block. Mitigation: queries are simple indexed SELECTs; the DB is local and WAL-mode.
- ChaosMonkey in a CI environment could destabilize builds. Mitigation: it is explicitly opt-in and defaults to
active: false.
Alternatives Considered
- Inline conditionals in
auto-verification.js— rejected because it creates a monolithic, untestable verification block. - Plugin system with dynamic
import()— rejected because ESM dynamic imports in an extension context add unnecessary complexity; static imports + a registry Map are sufficient. - Separate microservices for cost/outcome learning — rejected because the SF design principle keeps all state on disk in
.sf/; adding network boundaries violates the single-writer invariant.
Testing Strategy
Every gate has dedicated behavioral tests in tests/uok-gates.test.mjs:
- SecurityGate: missing script, passing scan, failing scan.
- CostGuardGate: empty ledger (pass), unit budget exceeded (fail), hourly budget exceeded (fail), high-tier failure pattern (fail).
- OutcomeLearningGate: empty ledger (pass), unit failure rate high (fail), model failure rate high (fail), escalation pattern (fail).
- ChaosMonkey: inactive (no-op), latency injection, partial failure, disk stress, event clearing.
uok-message-bus.test.mjs covers send/receive, broadcast, persistence across reconstruction, read-state persistence, compaction, conversation filtering, and max-size enforcement.
uok-unit-runtime.test.mjs covers FSM transitions, terminal-status classification, retry budgets, synthetic-unit blocking, and record IO (write/read/clear/list).