# Metrics Central vs RA.Aid Architecture Review

**Date**: 2026-05-07
**Reviewer**: Claude Code (SF)
**Scope**: `metrics-central.js` and its wiring, compared against RA.Aid patterns

---

## RA.Aid Architecture Summary

RA.Aid is a Python-based autonomous coding agent with these key architectural decisions:

| Layer | Pattern |
|-------|---------|
| **State** | Peewee ORM over SQLite (`.ra-aid/pk.db`), WAL mode, contextvars for connection scoping |
| **Agents** | LangGraph agents (research → planning → implementation) with explicit stage boundaries |
| **Memory** | Key facts, key snippets, research notes, trajectories — all DB-backed with repositories |
| **Trajectory** | Every tool call recorded: tool_name, parameters, result, cost, tokens, is_error, error_message |
| **Config** | JSON config file + runtime config repository with defaults |
| **Shell** | Interactive approval with cowboy_mode bypass, trajectory logging, timeout handling |
| **Reasoning** | Optional expert model consultation before each stage (reasoning_assist) |
| **Recovery** | Fallback handlers, retry with backoff, agent thread manager |

### RA.Aid's Observability Model

RA.Aid doesn't have a separate metrics system. Instead, observability is **embedded in the trajectory**:

- Every tool execution → `Trajectory` record with cost, tokens, timing
- Every stage transition → `Trajectory` record with `record_type="stage_transition"`
- Every human input → `HumanInput` record linked to trajectories
- Every error → `Trajectory` with `is_error=true`, `error_type`, `error_details`

This is **event-sourced observability**: the DB is the single source of truth for both state AND metrics.

---

## Our Metrics-Central.js Design

### What We Built

A Prometheus-compatible metrics collector with:
- Counter, Gauge, Histogram types
- In-memory aggregation with 60s flush to `.sf/runtime/sf-metrics.prom`
- Pre-defined metric metadata registry
- Wiring into subagent inheritance and mode transitions

### Design Decisions and Their Trade-offs

| Decision | Rationale | RA.Aid Comparison |
|----------|-----------|-------------------|
| **Prometheus text format** | Compatible with existing exposition, scrapeable by Grafana | RA.Aid uses DB queries; we support both |
| **In-memory aggregation** | Zero dependencies, fast | RA.Aid queries DB directly; we add a layer |
| **60s flush interval** | Batch writes, reduce I/O | RA.Aid writes per event; we batch |
| **Separate from trajectory/audit** | Metrics are aggregated views, not individual events | RA.Aid conflates events and metrics |
| **Metric metadata registry** | Pre-defined help text and labels | RA.Aid uses Peewee model definitions |

---

## The Review: 5 Lenses

### Lens 1: Data Model Consistency

**RA.Aid Pattern**: Single SQLite DB with typed models. Trajectory is the universal event log.

**Our Pattern**: Dual persistence:
- SQLite for operational state (UOK, sessions, tasks)
- Prometheus text file for metrics exposition
- JSONL for event durability

**Verdict**: ⚠️ **NEEDS WORK**

We have THREE observability sinks (SQLite, Prometheus file, JSONL) where RA.Aid has one. This creates:
- Risk of inconsistency between `sf-metrics.prom` and `sf.db`
- No unified query surface for "show me all subagent blocks in the last hour"
- Metrics file is write-only; no read path for programmatic consumption

**Recommendation**: Add a `metrics` table to `sf.db` that mirrors the Prometheus data model. The text file becomes a **projection**, not a source of truth.

```sql
CREATE TABLE metrics (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    type TEXT NOT NULL CHECK(type IN ('counter', 'gauge', 'histogram')),
    labels TEXT, -- JSON object
    value REAL NOT NULL,
    timestamp TEXT NOT NULL DEFAULT (datetime('now')),
    session_id TEXT
);
```

### Lens 2: Event-Sourced vs Aggregated

**RA.Aid Pattern**: Every event is a row. Aggregation happens at query time.

**Our Pattern**: Aggregation happens at write time. Individual events are lost.

**Verdict**: ✅ **ACCEPTABLE for metrics, but incomplete for observability**

For counters and gauges, aggregation is correct. But for debugging "why was this subagent blocked?", we need the individual event, not just `sf_subagent_dispatch_blocked{reason="provider"} 5`.

**Recommendation**: Keep metrics-central for aggregated Prometheus output, but ALSO emit individual events to the audit/trajectory system. The metric is the summary; the trajectory is the detail.

### Lens 3: Context and Session Scoping

**RA.Aid Pattern**: Every record has a `session_id` foreign key. Contextvars scope the DB connection.

**Our Pattern**: Metrics are global to the process. No session scoping.

**Verdict**: ❌ **GAP**

Our metrics can't answer: "How many subagent dispatches were blocked in session X?" This is critical for:
- Per-session cost attribution
- Debugging why a specific run failed
- Multi-tenant scenarios (if SF ever serves multiple users)

**Recommendation**: Add `session_id` label to all metrics. Use `ctx.sessionId` or `getAutoSession().currentTraceId`.

### Lens 4: Cost and Token Tracking

**RA.Aid Pattern**: Every trajectory record has `current_cost`, `input_tokens`, `output_tokens`.

**Our Pattern**: No cost/token metrics in metrics-central yet.

**Verdict**: ❌ **MISSING**

RA.Aid tracks cost per tool call. We track cost in `metrics.js` (SQLite + JSONL) but not in metrics-central. This means:
- No Prometheus-compatible cost metrics
- No cost alerts from Grafana
- No cost attribution by work mode or permission profile

**Recommendation**: Add cost/token metrics:
```javascript
"sf_cost_total": { help: "Total cost in USD", labels: ["work_mode", "model_id"] },
"sf_tokens_input_total": { help: "Total input tokens", labels: ["model_id"] },
"sf_tokens_output_total": { help: "Total output tokens", labels: ["model_id"] },
```

### Lens 5: Error Handling and Resilience

**RA.Aid Pattern**: Every error is caught, logged, and stored in the trajectory with full context.

**Our Pattern**: `flushMetrics()` catches and logs with `logWarning()`. No retry.

**Verdict**: ⚠️ **ACCEPTABLE but could be stronger**

Our flush failure is best-effort, which matches RA.Aid's philosophy. But RA.Aid also:
- Reopens closed DB connections automatically
- Has fallback handlers for agent failures
- Records error details in the trajectory

**Recommendation**: 
1. Add retry with exponential backoff for flush failures
2. If flush fails 3 times, emit a `metrics_flush_failed` counter
3. On process exit, attempt a final synchronous flush

---

## Specific Code Review Findings

### Finding 1: Unused Import

```javascript
import { isDbAvailable } from "./sf-db.js";
```

This is imported but never used. The JSDoc mentions "Optional SQLite persistence" but it's not implemented.

**Fix**: Either implement DB persistence or remove the import.

### Finding 2: Histogram Bucket Sorting

```javascript
this.buckets = [...buckets].sort((a, b) => a - b);
```

This mutates the input array (creates a copy first, so safe). But Prometheus expects buckets in ascending order, which is guaranteed.

**Verdict**: ✅ Correct.

### Finding 3: Label Key Serialization

```javascript
_key(labels) {
    return this.labelNames.map((k) => `${k}=${labels[k] ?? ""}`).join(",");
}
```

If a label value contains `=` or `,`, the key parsing will break.

**Fix**: Add escaping or use a structured key format (e.g., JSON).

### Finding 4: No Validation on Metric Names

```javascript
export function recordCounter(name, labels = {}, amount = 1) {
    const meta = getMetricMeta(name);
    getRegistry().counter(name, meta.help, Object.keys(labels)).inc(labels, amount);
}
```

If `name` contains spaces or invalid Prometheus characters, the output will be malformed.

**Fix**: Add `validateMetricName(name)` that rejects invalid characters.

### Finding 5: Timer Unref

```javascript
if (_flushTimer.unref) _flushTimer.unref();
```

This is correct for Node.js but may not work in all environments (e.g., Bun).

**Verdict**: ✅ Acceptable with fallback.

---

## Overall Assessment

| Dimension | Grade | Notes |
|-----------|-------|-------|
| **Correctness** | B+ | Prometheus output is valid, but label escaping needs work |
| **Completeness** | B | Missing cost/token metrics, session scoping, DB persistence |
| **Consistency with SF** | A | Fits the extension model, uses existing patterns |
| **Consistency with RA.Aid** | C | RA.Aid would prefer event-sourced over aggregated |
| **Production Readiness** | B | Needs retry, validation, and DB projection before GA |

### Priority Fixes

1. **P0**: Add `session_id` label to all metrics
2. **P0**: Remove unused `isDbAvailable` import or implement DB persistence
3. **P1**: Add cost/token metrics
4. **P1**: Fix label value escaping
5. **P1**: Add metric name validation
6. **P2**: Add retry with backoff for flush failures
7. **P2**: Add final flush on process exit
8. **P2**: Consider a `metrics` table in `sf.db` as source of truth

### RA.Aid Patterns Worth Adopting

1. **Trajectory-style event logging**: Every metric should have a corresponding event in the audit/trajectory system
2. **Session-scoped connections**: All observability should be filterable by session
3. **Per-tool cost tracking**: Every tool call should record cost and tokens
4. **Error detail preservation**: When metrics indicate failure, the detail should be queryable

---

## Conclusion

`metrics-central.js` is a solid Prometheus-compatible metrics layer that fills a real gap in SF's observability. However, it prioritizes **exposition format** over **observability depth**. RA.Aid's trajectory model is superior for debugging and audit because it preserves every event.

The right path forward:
1. Keep metrics-central for Prometheus output (Grafana compatibility)
2. Add a `metrics` table to `sf.db` for queryable aggregation
3. Ensure every metric has a corresponding audit/trajectory event
4. Add session scoping and cost tracking

This gives us the best of both worlds: Prometheus for dashboards, SQLite for queries, and trajectory for debugging.