singularity-forge/docs/records/2026-05-07-metrics-central-vs-ra-aid-review.md

# Metrics Central vs RA.Aid Architecture Review

**Date**: 2026-05-07
**Reviewer**: Claude Code (SF)
**Scope**: `metrics-central.js` and its wiring, compared against RA.Aid patterns

---

## RA.Aid Architecture Summary

RA.Aid is a Python-based autonomous coding agent with these key architectural decisions:

| Layer | Pattern |
|-------|---------|
| **State** | Peewee ORM over SQLite (`.ra-aid/pk.db`), WAL mode, contextvars for connection scoping |
| **Agents** | LangGraph agents (research → planning → implementation) with explicit stage boundaries |
| **Memory** | Key facts, key snippets, research notes, trajectories — all DB-backed with repositories |
| **Trajectory** | Every tool call recorded: tool_name, parameters, result, cost, tokens, is_error, error_message |
| **Config** | JSON config file + runtime config repository with defaults |
| **Shell** | Interactive approval with cowboy_mode bypass, trajectory logging, timeout handling |
| **Reasoning** | Optional expert model consultation before each stage (reasoning_assist) |
| **Recovery** | Fallback handlers, retry with backoff, agent thread manager |

### RA.Aid's Observability Model

RA.Aid doesn't have a separate metrics system. Instead, observability is **embedded in the trajectory**:

- Every tool execution → `Trajectory` record with cost, tokens, timing
- Every stage transition → `Trajectory` record with `record_type="stage_transition"`
- Every human input → `HumanInput` record linked to trajectories
- Every error → `Trajectory` with `is_error=true`, `error_type`, `error_details`

This is **event-sourced observability**: the DB is the single source of truth for both state AND metrics.

---

## Our Metrics-Central.js Design

### What We Built

A Prometheus-compatible metrics collector with:
- Counter, Gauge, Histogram types
- In-memory aggregation with 60s flush to `.sf/runtime/sf-metrics.prom`
- Pre-defined metric metadata registry
- Wiring into subagent inheritance and mode transitions

### Design Decisions and Their Trade-offs

| Decision | Rationale | RA.Aid Comparison |
|----------|-----------|-------------------|
| **Prometheus text format** | Compatible with existing exposition, scrapeable by Grafana | RA.Aid uses DB queries; we support both |
| **In-memory aggregation** | Zero dependencies, fast | RA.Aid queries DB directly; we add a layer |
| **60s flush interval** | Batch writes, reduce I/O | RA.Aid writes per event; we batch |
| **Separate from trajectory/audit** | Metrics are aggregated views, not individual events | RA.Aid conflates events and metrics |
| **Metric metadata registry** | Pre-defined help text and labels | RA.Aid uses Peewee model definitions |

---

## The Review: 5 Lenses

### Lens 1: Data Model Consistency

**RA.Aid Pattern**: Single SQLite DB with typed models. Trajectory is the universal event log.

**Our Pattern**: Dual persistence:
- SQLite for operational state (UOK, sessions, tasks)
- Prometheus text file for metrics exposition
- JSONL for event durability

**Verdict**: ⚠️ **NEEDS WORK**

We have THREE observability sinks (SQLite, Prometheus file, JSONL) where RA.Aid has one. This creates:
- Risk of inconsistency between `sf-metrics.prom` and `sf.db`
- No unified query surface for "show me all subagent blocks in the last hour"
- Metrics file is write-only; no read path for programmatic consumption

**Recommendation**: Add a `metrics` table to `sf.db` that mirrors the Prometheus data model. The text file becomes a **projection**, not a source of truth.

```sql
CREATE TABLE metrics (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    type TEXT NOT NULL CHECK(type IN ('counter', 'gauge', 'histogram')),
    labels TEXT, -- JSON object
    value REAL NOT NULL,
    timestamp TEXT NOT NULL DEFAULT (datetime('now')),
    session_id TEXT
);
```

### Lens 2: Event-Sourced vs Aggregated

**RA.Aid Pattern**: Every event is a row. Aggregation happens at query time.

**Our Pattern**: Aggregation happens at write time. Individual events are lost.

**Verdict**: ✅ **ACCEPTABLE for metrics, but incomplete for observability**

For counters and gauges, aggregation is correct. But for debugging "why was this subagent blocked?", we need the individual event, not just `sf_subagent_dispatch_blocked{reason="provider"} 5`.

**Recommendation**: Keep metrics-central for aggregated Prometheus output, but ALSO emit individual events to the audit/trajectory system. The metric is the summary; the trajectory is the detail.

### Lens 3: Context and Session Scoping

**RA.Aid Pattern**: Every record has a `session_id` foreign key. Contextvars scope the DB connection.

**Our Pattern**: Metrics are global to the process. No session scoping.

**Verdict**: ❌ **GAP**

Our metrics can't answer: "How many subagent dispatches were blocked in session X?" This is critical for:
- Per-session cost attribution
- Debugging why a specific run failed
- Multi-tenant scenarios (if SF ever serves multiple users)

**Recommendation**: Add `session_id` label to all metrics. Use `ctx.sessionId` or `getAutoSession().currentTraceId`.

### Lens 4: Cost and Token Tracking

**RA.Aid Pattern**: Every trajectory record has `current_cost`, `input_tokens`, `output_tokens`.

**Our Pattern**: No cost/token metrics in metrics-central yet.

**Verdict**: ❌ **MISSING**

RA.Aid tracks cost per tool call. We track cost in `metrics.js` (SQLite + JSONL) but not in metrics-central. This means:
- No Prometheus-compatible cost metrics
- No cost alerts from Grafana
- No cost attribution by work mode or permission profile

**Recommendation**: Add cost/token metrics:
```javascript
"sf_cost_total": { help: "Total cost in USD", labels: ["work_mode", "model_id"] },
"sf_tokens_input_total": { help: "Total input tokens", labels: ["model_id"] },
"sf_tokens_output_total": { help: "Total output tokens", labels: ["model_id"] },
```

### Lens 5: Error Handling and Resilience

**RA.Aid Pattern**: Every error is caught, logged, and stored in the trajectory with full context.

**Our Pattern**: `flushMetrics()` catches and logs with `logWarning()`. No retry.

**Verdict**: ⚠️ **ACCEPTABLE but could be stronger**

Our flush failure is best-effort, which matches RA.Aid's philosophy. But RA.Aid also:
- Reopens closed DB connections automatically
- Has fallback handlers for agent failures
- Records error details in the trajectory

**Recommendation**: 
1. Add retry with exponential backoff for flush failures
2. If flush fails 3 times, emit a `metrics_flush_failed` counter
3. On process exit, attempt a final synchronous flush

---

## Specific Code Review Findings

### Finding 1: Unused Import

```javascript
import { isDbAvailable } from "./sf-db.js";
```

This is imported but never used. The JSDoc mentions "Optional SQLite persistence" but it's not implemented.

**Fix**: Either implement DB persistence or remove the import.

### Finding 2: Histogram Bucket Sorting

```javascript
this.buckets = [...buckets].sort((a, b) => a - b);
```

This mutates the input array (creates a copy first, so safe). But Prometheus expects buckets in ascending order, which is guaranteed.

**Verdict**: ✅ Correct.

### Finding 3: Label Key Serialization

```javascript
_key(labels) {
    return this.labelNames.map((k) => `${k}=${labels[k] ?? ""}`).join(",");
}
```

If a label value contains `=` or `,`, the key parsing will break.

**Fix**: Add escaping or use a structured key format (e.g., JSON).

### Finding 4: No Validation on Metric Names

```javascript
export function recordCounter(name, labels = {}, amount = 1) {
    const meta = getMetricMeta(name);
    getRegistry().counter(name, meta.help, Object.keys(labels)).inc(labels, amount);
}
```

If `name` contains spaces or invalid Prometheus characters, the output will be malformed.

**Fix**: Add `validateMetricName(name)` that rejects invalid characters.

### Finding 5: Timer Unref

```javascript
if (_flushTimer.unref) _flushTimer.unref();
```

This is correct for Node.js but may not work in all environments (e.g., Bun).

**Verdict**: ✅ Acceptable with fallback.

---

## Overall Assessment

| Dimension | Grade | Notes |
|-----------|-------|-------|
| **Correctness** | B+ | Prometheus output is valid, but label escaping needs work |
| **Completeness** | B | Missing cost/token metrics, session scoping, DB persistence |
| **Consistency with SF** | A | Fits the extension model, uses existing patterns |
| **Consistency with RA.Aid** | C | RA.Aid would prefer event-sourced over aggregated |
| **Production Readiness** | B | Needs retry, validation, and DB projection before GA |

### Priority Fixes

1. **P0**: Add `session_id` label to all metrics
2. **P0**: Remove unused `isDbAvailable` import or implement DB persistence
3. **P1**: Add cost/token metrics
4. **P1**: Fix label value escaping
5. **P1**: Add metric name validation
6. **P2**: Add retry with backoff for flush failures
7. **P2**: Add final flush on process exit
8. **P2**: Consider a `metrics` table in `sf.db` as source of truth

### RA.Aid Patterns Worth Adopting

1. **Trajectory-style event logging**: Every metric should have a corresponding event in the audit/trajectory system
2. **Session-scoped connections**: All observability should be filterable by session
3. **Per-tool cost tracking**: Every tool call should record cost and tokens
4. **Error detail preservation**: When metrics indicate failure, the detail should be queryable

---

## Conclusion

`metrics-central.js` is a solid Prometheus-compatible metrics layer that fills a real gap in SF's observability. However, it prioritizes **exposition format** over **observability depth**. RA.Aid's trajectory model is superior for debugging and audit because it preserves every event.

The right path forward:
1. Keep metrics-central for Prometheus output (Grafana compatibility)
2. Add a `metrics` table to `sf.db` for queryable aggregation
3. Ensure every metric has a corresponding audit/trajectory event
4. Add session scoping and cost tracking

This gives us the best of both worlds: Prometheus for dashboards, SQLite for queries, and trajectory for debugging.
sf snapshot: uncommitted changes after 202m inactivity 2026-05-08 13:31:08 +02:00			`# Metrics Central vs RA.Aid Architecture Review`

			`Date: 2026-05-07`
			`Reviewer: Claude Code (SF)`
			Scope: `metrics-central.js` and its wiring, compared against RA.Aid patterns

			`---`

			`## RA.Aid Architecture Summary`

			`RA.Aid is a Python-based autonomous coding agent with these key architectural decisions:`

			`\| Layer \| Pattern \|`
			`\|-------\|---------\|`
			\| State \| Peewee ORM over SQLite (`.ra-aid/pk.db`), WAL mode, contextvars for connection scoping \|
			`\| Agents \| LangGraph agents (research → planning → implementation) with explicit stage boundaries \|`
			`\| Memory \| Key facts, key snippets, research notes, trajectories — all DB-backed with repositories \|`
			`\| Trajectory \| Every tool call recorded: tool_name, parameters, result, cost, tokens, is_error, error_message \|`
			`\| Config \| JSON config file + runtime config repository with defaults \|`
			`\| Shell \| Interactive approval with cowboy_mode bypass, trajectory logging, timeout handling \|`
			`\| Reasoning \| Optional expert model consultation before each stage (reasoning_assist) \|`
			`\| Recovery \| Fallback handlers, retry with backoff, agent thread manager \|`

			`### RA.Aid's Observability Model`

			`RA.Aid doesn't have a separate metrics system. Instead, observability is embedded in the trajectory:`

			- Every tool execution → `Trajectory` record with cost, tokens, timing
			- Every stage transition → `Trajectory` record with `record_type="stage_transition"`
			- Every human input → `HumanInput` record linked to trajectories
			- Every error → `Trajectory` with `is_error=true`, `error_type`, `error_details`

			`This is event-sourced observability: the DB is the single source of truth for both state AND metrics.`

			`---`

			`## Our Metrics-Central.js Design`

			`### What We Built`

			`A Prometheus-compatible metrics collector with:`
			`- Counter, Gauge, Histogram types`
			- In-memory aggregation with 60s flush to `.sf/runtime/sf-metrics.prom`
			`- Pre-defined metric metadata registry`
			`- Wiring into subagent inheritance and mode transitions`

			`### Design Decisions and Their Trade-offs`

			`\| Decision \| Rationale \| RA.Aid Comparison \|`
			`\|----------\|-----------\|-------------------\|`
			`\| Prometheus text format \| Compatible with existing exposition, scrapeable by Grafana \| RA.Aid uses DB queries; we support both \|`
			`\| In-memory aggregation \| Zero dependencies, fast \| RA.Aid queries DB directly; we add a layer \|`
			`\| 60s flush interval \| Batch writes, reduce I/O \| RA.Aid writes per event; we batch \|`
			`\| Separate from trajectory/audit \| Metrics are aggregated views, not individual events \| RA.Aid conflates events and metrics \|`
			`\| Metric metadata registry \| Pre-defined help text and labels \| RA.Aid uses Peewee model definitions \|`

			`---`

			`## The Review: 5 Lenses`

			`### Lens 1: Data Model Consistency`

			`RA.Aid Pattern: Single SQLite DB with typed models. Trajectory is the universal event log.`

			`Our Pattern: Dual persistence:`
			`- SQLite for operational state (UOK, sessions, tasks)`
			`- Prometheus text file for metrics exposition`
			`- JSONL for event durability`

			`Verdict: ⚠️ NEEDS WORK`

			`We have THREE observability sinks (SQLite, Prometheus file, JSONL) where RA.Aid has one. This creates:`
			- Risk of inconsistency between `sf-metrics.prom` and `sf.db`
			`- No unified query surface for "show me all subagent blocks in the last hour"`
			`- Metrics file is write-only; no read path for programmatic consumption`

			Recommendation: Add a `metrics` table to `sf.db` that mirrors the Prometheus data model. The text file becomes a projection, not a source of truth.

			```sql
			`CREATE TABLE metrics (`
			`id INTEGER PRIMARY KEY AUTOINCREMENT,`
			`name TEXT NOT NULL,`
			`type TEXT NOT NULL CHECK(type IN ('counter', 'gauge', 'histogram')),`
			`labels TEXT, -- JSON object`
			`value REAL NOT NULL,`
			`timestamp TEXT NOT NULL DEFAULT (datetime('now')),`
			`session_id TEXT`
			`);`
			```

			`### Lens 2: Event-Sourced vs Aggregated`

			`RA.Aid Pattern: Every event is a row. Aggregation happens at query time.`

			`Our Pattern: Aggregation happens at write time. Individual events are lost.`

			`Verdict: ✅ ACCEPTABLE for metrics, but incomplete for observability`

			For counters and gauges, aggregation is correct. But for debugging "why was this subagent blocked?", we need the individual event, not just `sf_subagent_dispatch_blocked{reason="provider"} 5`.

			`Recommendation: Keep metrics-central for aggregated Prometheus output, but ALSO emit individual events to the audit/trajectory system. The metric is the summary; the trajectory is the detail.`

			`### Lens 3: Context and Session Scoping`

			RA.Aid Pattern: Every record has a `session_id` foreign key. Contextvars scope the DB connection.

			`Our Pattern: Metrics are global to the process. No session scoping.`

			`Verdict: ❌ GAP`

			`Our metrics can't answer: "How many subagent dispatches were blocked in session X?" This is critical for:`
			`- Per-session cost attribution`
			`- Debugging why a specific run failed`
			`- Multi-tenant scenarios (if SF ever serves multiple users)`

			Recommendation: Add `session_id` label to all metrics. Use `ctx.sessionId` or `getAutoSession().currentTraceId`.

			`### Lens 4: Cost and Token Tracking`

			RA.Aid Pattern: Every trajectory record has `current_cost`, `input_tokens`, `output_tokens`.

			`Our Pattern: No cost/token metrics in metrics-central yet.`

			`Verdict: ❌ MISSING`

			RA.Aid tracks cost per tool call. We track cost in `metrics.js` (SQLite + JSONL) but not in metrics-central. This means:
			`- No Prometheus-compatible cost metrics`
			`- No cost alerts from Grafana`
			`- No cost attribution by work mode or permission profile`

			`Recommendation: Add cost/token metrics:`
			```javascript
			`"sf_cost_total": { help: "Total cost in USD", labels: ["work_mode", "model_id"] },`
			`"sf_tokens_input_total": { help: "Total input tokens", labels: ["model_id"] },`
			`"sf_tokens_output_total": { help: "Total output tokens", labels: ["model_id"] },`
			```

			`### Lens 5: Error Handling and Resilience`

			`RA.Aid Pattern: Every error is caught, logged, and stored in the trajectory with full context.`

			Our Pattern: `flushMetrics()` catches and logs with `logWarning()`. No retry.

			`Verdict: ⚠️ ACCEPTABLE but could be stronger`

			`Our flush failure is best-effort, which matches RA.Aid's philosophy. But RA.Aid also:`
			`- Reopens closed DB connections automatically`
			`- Has fallback handlers for agent failures`
			`- Records error details in the trajectory`

			`Recommendation:`
			`1. Add retry with exponential backoff for flush failures`
			2. If flush fails 3 times, emit a `metrics_flush_failed` counter
			`3. On process exit, attempt a final synchronous flush`

			`---`

			`## Specific Code Review Findings`

			`### Finding 1: Unused Import`

			```javascript
			`import { isDbAvailable } from "./sf-db.js";`
			```

			`This is imported but never used. The JSDoc mentions "Optional SQLite persistence" but it's not implemented.`

			`Fix: Either implement DB persistence or remove the import.`

			`### Finding 2: Histogram Bucket Sorting`

			```javascript
			`this.buckets = [...buckets].sort((a, b) => a - b);`
			```

			`This mutates the input array (creates a copy first, so safe). But Prometheus expects buckets in ascending order, which is guaranteed.`

			`Verdict: ✅ Correct.`

			`### Finding 3: Label Key Serialization`

			```javascript
			`_key(labels) {`
			return this.labelNames.map((k) => `${k}=${labels[k] ?? ""}`).join(",");
			`}`
			```

			If a label value contains `=` or `,`, the key parsing will break.

			`Fix: Add escaping or use a structured key format (e.g., JSON).`

			`### Finding 4: No Validation on Metric Names`

			```javascript
			`export function recordCounter(name, labels = {}, amount = 1) {`
			`const meta = getMetricMeta(name);`
			`getRegistry().counter(name, meta.help, Object.keys(labels)).inc(labels, amount);`
			`}`
			```

			If `name` contains spaces or invalid Prometheus characters, the output will be malformed.

			Fix: Add `validateMetricName(name)` that rejects invalid characters.

			`### Finding 5: Timer Unref`

			```javascript
			`if (_flushTimer.unref) _flushTimer.unref();`
			```

			`This is correct for Node.js but may not work in all environments (e.g., Bun).`

			`Verdict: ✅ Acceptable with fallback.`

			`---`

			`## Overall Assessment`

			`\| Dimension \| Grade \| Notes \|`
			`\|-----------\|-------\|-------\|`
			`\| Correctness \| B+ \| Prometheus output is valid, but label escaping needs work \|`
			`\| Completeness \| B \| Missing cost/token metrics, session scoping, DB persistence \|`
			`\| Consistency with SF \| A \| Fits the extension model, uses existing patterns \|`
			`\| Consistency with RA.Aid \| C \| RA.Aid would prefer event-sourced over aggregated \|`
			`\| Production Readiness \| B \| Needs retry, validation, and DB projection before GA \|`

			`### Priority Fixes`

			1. P0: Add `session_id` label to all metrics
			2. P0: Remove unused `isDbAvailable` import or implement DB persistence
			`3. P1: Add cost/token metrics`
			`4. P1: Fix label value escaping`
			`5. P1: Add metric name validation`
			`6. P2: Add retry with backoff for flush failures`
			`7. P2: Add final flush on process exit`
			8. P2: Consider a `metrics` table in `sf.db` as source of truth

			`### RA.Aid Patterns Worth Adopting`

			`1. Trajectory-style event logging: Every metric should have a corresponding event in the audit/trajectory system`
			`2. Session-scoped connections: All observability should be filterable by session`
			`3. Per-tool cost tracking: Every tool call should record cost and tokens`
			`4. Error detail preservation: When metrics indicate failure, the detail should be queryable`

			`---`

			`## Conclusion`

			`metrics-central.js` is a solid Prometheus-compatible metrics layer that fills a real gap in SF's observability. However, it prioritizes exposition format over observability depth. RA.Aid's trajectory model is superior for debugging and audit because it preserves every event.

			`The right path forward:`
			`1. Keep metrics-central for Prometheus output (Grafana compatibility)`
			2. Add a `metrics` table to `sf.db` for queryable aggregation
			`3. Ensure every metric has a corresponding audit/trajectory event`
			`4. Add session scoping and cost tracking`

			`This gives us the best of both worlds: Prometheus for dashboards, SQLite for queries, and trajectory for debugging.`