singularity-forge/docs/adr/0075-uok-gate-architecture.md
2026-05-07 03:09:55 +02:00

5.6 KiB

ADR-0075: UOK Gate Architecture

Status: Accepted
Date: 2026-05-06
Deciders: UOK subsystem migration (M013 S04)

Context

The Unit Orchestration Kernel (UOK) post-unit verification flow originally had a single ad-hoc gate: the Security Gate (secret scanning). As the autonomous loop matured, we needed a structured, extensible way to enforce policy, verify correctness, learn from outcomes, and stress-test durability — without bloating the kernel loop with inline conditionals.

Decision

We adopt a gate-runner pattern with explicitly typed gates, a uniform execution contract, durable audit logging, and a configurable retry matrix.

Gate Contract

Every gate implements:

  • id: string — unique identifier (e.g. "cost-guard")
  • type: string"security" | "policy" | "verification" | "learning" | "chaos"
  • execute(ctx: UokContext, attempt: number): Promise<GateResult>

The UokContext carries traceable identifiers (traceId, turnId, unitType, unitId, modelId, provider) plus runtime telemetry (tokenCount, costUsd, durationMs).

The GateResult is a sealed union:

  • outcome: "pass" | "fail" | "retry" | "manual-attention"
  • failureClass: "policy" | "verification" | "execution" | "artifact" | "git" | "timeout" | "input" | "closeout" | "manual-attention" | "unknown"
  • rationale: string — human-readable explanation
  • findings?: string — structured output (diffs, logs, cost breakdowns)
  • recommendation?: string — actionable next step

Retry Matrix

The UokGateRunner consults a per-failure-class retry ceiling:

failureClass max retries
policy, input, manual-attention 0
execution, artifact, verification, git 1
timeout 2
unknown 0

Retries are persisted to the gate_runs SQLite table and emitted as audit events so operators can reconstruct the full retry chain.

Implemented Gates

Gate Type Purpose Durable Store
SecurityGate security Run scripts/secret-scan.sh against uncommitted changes N/A (external script)
CostGuardGate policy Enforce per-unit and per-hour USD budgets; detect high-tier model burn llm_task_outcomes (SQLite) + model-cost-table.js
OutcomeLearningGate learning Detect failure patterns by model, unit type, and escalation rate llm_task_outcomes (SQLite)
MultiPackageGate verification Verify only affected workspace packages and downstream dependents N/A (git + package.json)
ChaosMonkey chaos Inject latency, partial failures, disk stress, memory pressure N/A (ephemeral)

Durable Message Bus

The MessageBus persists messages to .sf/sf.db (uok_messages and uok_message_reads) with at-least-once delivery. The old .sf/runtime/uok-messages.jsonl and per-agent inbox JSON files are legacy artifacts only; normal runtime message state is SQLite-backed. Messages are pruned by TTL (retentionDays, default 7) and inbox size is capped (maxInboxSize, default 1000).

Chaos Engineering Safety

ChaosMonkey is opt-in only (active: false by default). It injects recoverable faults only:

  • Latency delays (configurable max)
  • Retryable thrown errors (err.code = "CHAOS_INJECTED")
  • Disk stress (temp files written then immediately deleted)
  • Memory stress (buffers allocated then released)

It never sends SIGKILL or mutates production state.

Consequences

Positive:

  • Adding a new gate is a single file + registration line — no kernel loop changes.
  • Every gate execution is auditable in SQLite and parity JSONL.
  • Retry policy is data-driven, not hard-coded per gate.
  • Cost and outcome learning are grounded in real ledger data, not heuristics.

Negative / Mitigated:

  • Gate execution adds latency to the verification path. Mitigation: gates run in parallel where possible; timeout defaults are conservative (10s for git diff, 120s for typecheck).
  • SQLite queries on the critical path could block. Mitigation: queries are simple indexed SELECTs; the DB is local and WAL-mode.
  • ChaosMonkey in a CI environment could destabilize builds. Mitigation: it is explicitly opt-in and defaults to active: false.

Alternatives Considered

  1. Inline conditionals in auto-verification.js — rejected because it creates a monolithic, untestable verification block.
  2. Plugin system with dynamic import() — rejected because ESM dynamic imports in an extension context add unnecessary complexity; static imports + a registry Map are sufficient.
  3. Separate microservices for cost/outcome learning — rejected because the SF design principle keeps all state on disk in .sf/; adding network boundaries violates the single-writer invariant.

Testing Strategy

Every gate has dedicated behavioral tests in tests/uok-gates.test.mjs:

  • SecurityGate: missing script, passing scan, failing scan.
  • CostGuardGate: empty ledger (pass), unit budget exceeded (fail), hourly budget exceeded (fail), high-tier failure pattern (fail).
  • OutcomeLearningGate: empty ledger (pass), unit failure rate high (fail), model failure rate high (fail), escalation pattern (fail).
  • ChaosMonkey: inactive (no-op), latency injection, partial failure, disk stress, event clearing.

uok-message-bus.test.mjs covers send/receive, broadcast, persistence across reconstruction, read-state persistence, compaction, conversation filtering, and max-size enforcement.

uok-unit-runtime.test.mjs covers FSM transitions, terminal-status classification, retry budgets, synthetic-unit blocking, and record IO (write/read/clear/list).