singularity-forge/docs/records/2026-05-07-metrics-central-vs-ra-aid-review.md

258 lines
9.9 KiB
Markdown
Raw Normal View History

# Metrics Central vs RA.Aid Architecture Review
**Date**: 2026-05-07
**Reviewer**: Claude Code (SF)
**Scope**: `metrics-central.js` and its wiring, compared against RA.Aid patterns
---
## RA.Aid Architecture Summary
RA.Aid is a Python-based autonomous coding agent with these key architectural decisions:
| Layer | Pattern |
|-------|---------|
| **State** | Peewee ORM over SQLite (`.ra-aid/pk.db`), WAL mode, contextvars for connection scoping |
| **Agents** | LangGraph agents (research → planning → implementation) with explicit stage boundaries |
| **Memory** | Key facts, key snippets, research notes, trajectories — all DB-backed with repositories |
| **Trajectory** | Every tool call recorded: tool_name, parameters, result, cost, tokens, is_error, error_message |
| **Config** | JSON config file + runtime config repository with defaults |
| **Shell** | Interactive approval with cowboy_mode bypass, trajectory logging, timeout handling |
| **Reasoning** | Optional expert model consultation before each stage (reasoning_assist) |
| **Recovery** | Fallback handlers, retry with backoff, agent thread manager |
### RA.Aid's Observability Model
RA.Aid doesn't have a separate metrics system. Instead, observability is **embedded in the trajectory**:
- Every tool execution → `Trajectory` record with cost, tokens, timing
- Every stage transition → `Trajectory` record with `record_type="stage_transition"`
- Every human input → `HumanInput` record linked to trajectories
- Every error → `Trajectory` with `is_error=true`, `error_type`, `error_details`
This is **event-sourced observability**: the DB is the single source of truth for both state AND metrics.
---
## Our Metrics-Central.js Design
### What We Built
A Prometheus-compatible metrics collector with:
- Counter, Gauge, Histogram types
- In-memory aggregation with 60s flush to `.sf/runtime/sf-metrics.prom`
- Pre-defined metric metadata registry
- Wiring into subagent inheritance and mode transitions
### Design Decisions and Their Trade-offs
| Decision | Rationale | RA.Aid Comparison |
|----------|-----------|-------------------|
| **Prometheus text format** | Compatible with existing exposition, scrapeable by Grafana | RA.Aid uses DB queries; we support both |
| **In-memory aggregation** | Zero dependencies, fast | RA.Aid queries DB directly; we add a layer |
| **60s flush interval** | Batch writes, reduce I/O | RA.Aid writes per event; we batch |
| **Separate from trajectory/audit** | Metrics are aggregated views, not individual events | RA.Aid conflates events and metrics |
| **Metric metadata registry** | Pre-defined help text and labels | RA.Aid uses Peewee model definitions |
---
## The Review: 5 Lenses
### Lens 1: Data Model Consistency
**RA.Aid Pattern**: Single SQLite DB with typed models. Trajectory is the universal event log.
**Our Pattern**: Dual persistence:
- SQLite for operational state (UOK, sessions, tasks)
- Prometheus text file for metrics exposition
- JSONL for event durability
**Verdict**: ⚠️ **NEEDS WORK**
We have THREE observability sinks (SQLite, Prometheus file, JSONL) where RA.Aid has one. This creates:
- Risk of inconsistency between `sf-metrics.prom` and `sf.db`
- No unified query surface for "show me all subagent blocks in the last hour"
- Metrics file is write-only; no read path for programmatic consumption
**Recommendation**: Add a `metrics` table to `sf.db` that mirrors the Prometheus data model. The text file becomes a **projection**, not a source of truth.
```sql
CREATE TABLE metrics (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
type TEXT NOT NULL CHECK(type IN ('counter', 'gauge', 'histogram')),
labels TEXT, -- JSON object
value REAL NOT NULL,
timestamp TEXT NOT NULL DEFAULT (datetime('now')),
session_id TEXT
);
```
### Lens 2: Event-Sourced vs Aggregated
**RA.Aid Pattern**: Every event is a row. Aggregation happens at query time.
**Our Pattern**: Aggregation happens at write time. Individual events are lost.
**Verdict**: ✅ **ACCEPTABLE for metrics, but incomplete for observability**
For counters and gauges, aggregation is correct. But for debugging "why was this subagent blocked?", we need the individual event, not just `sf_subagent_dispatch_blocked{reason="provider"} 5`.
**Recommendation**: Keep metrics-central for aggregated Prometheus output, but ALSO emit individual events to the audit/trajectory system. The metric is the summary; the trajectory is the detail.
### Lens 3: Context and Session Scoping
**RA.Aid Pattern**: Every record has a `session_id` foreign key. Contextvars scope the DB connection.
**Our Pattern**: Metrics are global to the process. No session scoping.
**Verdict**: ❌ **GAP**
Our metrics can't answer: "How many subagent dispatches were blocked in session X?" This is critical for:
- Per-session cost attribution
- Debugging why a specific run failed
- Multi-tenant scenarios (if SF ever serves multiple users)
**Recommendation**: Add `session_id` label to all metrics. Use `ctx.sessionId` or `getAutoSession().currentTraceId`.
### Lens 4: Cost and Token Tracking
**RA.Aid Pattern**: Every trajectory record has `current_cost`, `input_tokens`, `output_tokens`.
**Our Pattern**: No cost/token metrics in metrics-central yet.
**Verdict**: ❌ **MISSING**
RA.Aid tracks cost per tool call. We track cost in `metrics.js` (SQLite + JSONL) but not in metrics-central. This means:
- No Prometheus-compatible cost metrics
- No cost alerts from Grafana
- No cost attribution by work mode or permission profile
**Recommendation**: Add cost/token metrics:
```javascript
"sf_cost_total": { help: "Total cost in USD", labels: ["work_mode", "model_id"] },
"sf_tokens_input_total": { help: "Total input tokens", labels: ["model_id"] },
"sf_tokens_output_total": { help: "Total output tokens", labels: ["model_id"] },
```
### Lens 5: Error Handling and Resilience
**RA.Aid Pattern**: Every error is caught, logged, and stored in the trajectory with full context.
**Our Pattern**: `flushMetrics()` catches and logs with `logWarning()`. No retry.
**Verdict**: ⚠️ **ACCEPTABLE but could be stronger**
Our flush failure is best-effort, which matches RA.Aid's philosophy. But RA.Aid also:
- Reopens closed DB connections automatically
- Has fallback handlers for agent failures
- Records error details in the trajectory
**Recommendation**:
1. Add retry with exponential backoff for flush failures
2. If flush fails 3 times, emit a `metrics_flush_failed` counter
3. On process exit, attempt a final synchronous flush
---
## Specific Code Review Findings
### Finding 1: Unused Import
```javascript
import { isDbAvailable } from "./sf-db.js";
```
This is imported but never used. The JSDoc mentions "Optional SQLite persistence" but it's not implemented.
**Fix**: Either implement DB persistence or remove the import.
### Finding 2: Histogram Bucket Sorting
```javascript
this.buckets = [...buckets].sort((a, b) => a - b);
```
This mutates the input array (creates a copy first, so safe). But Prometheus expects buckets in ascending order, which is guaranteed.
**Verdict**: ✅ Correct.
### Finding 3: Label Key Serialization
```javascript
_key(labels) {
return this.labelNames.map((k) => `${k}=${labels[k] ?? ""}`).join(",");
}
```
If a label value contains `=` or `,`, the key parsing will break.
**Fix**: Add escaping or use a structured key format (e.g., JSON).
### Finding 4: No Validation on Metric Names
```javascript
export function recordCounter(name, labels = {}, amount = 1) {
const meta = getMetricMeta(name);
getRegistry().counter(name, meta.help, Object.keys(labels)).inc(labels, amount);
}
```
If `name` contains spaces or invalid Prometheus characters, the output will be malformed.
**Fix**: Add `validateMetricName(name)` that rejects invalid characters.
### Finding 5: Timer Unref
```javascript
if (_flushTimer.unref) _flushTimer.unref();
```
This is correct for Node.js but may not work in all environments (e.g., Bun).
**Verdict**: ✅ Acceptable with fallback.
---
## Overall Assessment
| Dimension | Grade | Notes |
|-----------|-------|-------|
| **Correctness** | B+ | Prometheus output is valid, but label escaping needs work |
| **Completeness** | B | Missing cost/token metrics, session scoping, DB persistence |
| **Consistency with SF** | A | Fits the extension model, uses existing patterns |
| **Consistency with RA.Aid** | C | RA.Aid would prefer event-sourced over aggregated |
| **Production Readiness** | B | Needs retry, validation, and DB projection before GA |
### Priority Fixes
1. **P0**: Add `session_id` label to all metrics
2. **P0**: Remove unused `isDbAvailable` import or implement DB persistence
3. **P1**: Add cost/token metrics
4. **P1**: Fix label value escaping
5. **P1**: Add metric name validation
6. **P2**: Add retry with backoff for flush failures
7. **P2**: Add final flush on process exit
8. **P2**: Consider a `metrics` table in `sf.db` as source of truth
### RA.Aid Patterns Worth Adopting
1. **Trajectory-style event logging**: Every metric should have a corresponding event in the audit/trajectory system
2. **Session-scoped connections**: All observability should be filterable by session
3. **Per-tool cost tracking**: Every tool call should record cost and tokens
4. **Error detail preservation**: When metrics indicate failure, the detail should be queryable
---
## Conclusion
`metrics-central.js` is a solid Prometheus-compatible metrics layer that fills a real gap in SF's observability. However, it prioritizes **exposition format** over **observability depth**. RA.Aid's trajectory model is superior for debugging and audit because it preserves every event.
The right path forward:
1. Keep metrics-central for Prometheus output (Grafana compatibility)
2. Add a `metrics` table to `sf.db` for queryable aggregation
3. Ensure every metric has a corresponding audit/trajectory event
4. Add session scoping and cost tracking
This gives us the best of both worlds: Prometheus for dashboards, SQLite for queries, and trajectory for debugging.