257 lines
9.9 KiB
Markdown
257 lines
9.9 KiB
Markdown
# Metrics Central vs RA.Aid Architecture Review
|
|
|
|
**Date**: 2026-05-07
|
|
**Reviewer**: Claude Code (SF)
|
|
**Scope**: `metrics-central.js` and its wiring, compared against RA.Aid patterns
|
|
|
|
---
|
|
|
|
## RA.Aid Architecture Summary
|
|
|
|
RA.Aid is a Python-based autonomous coding agent with these key architectural decisions:
|
|
|
|
| Layer | Pattern |
|
|
|-------|---------|
|
|
| **State** | Peewee ORM over SQLite (`.ra-aid/pk.db`), WAL mode, contextvars for connection scoping |
|
|
| **Agents** | LangGraph agents (research → planning → implementation) with explicit stage boundaries |
|
|
| **Memory** | Key facts, key snippets, research notes, trajectories — all DB-backed with repositories |
|
|
| **Trajectory** | Every tool call recorded: tool_name, parameters, result, cost, tokens, is_error, error_message |
|
|
| **Config** | JSON config file + runtime config repository with defaults |
|
|
| **Shell** | Interactive approval with cowboy_mode bypass, trajectory logging, timeout handling |
|
|
| **Reasoning** | Optional expert model consultation before each stage (reasoning_assist) |
|
|
| **Recovery** | Fallback handlers, retry with backoff, agent thread manager |
|
|
|
|
### RA.Aid's Observability Model
|
|
|
|
RA.Aid doesn't have a separate metrics system. Instead, observability is **embedded in the trajectory**:
|
|
|
|
- Every tool execution → `Trajectory` record with cost, tokens, timing
|
|
- Every stage transition → `Trajectory` record with `record_type="stage_transition"`
|
|
- Every human input → `HumanInput` record linked to trajectories
|
|
- Every error → `Trajectory` with `is_error=true`, `error_type`, `error_details`
|
|
|
|
This is **event-sourced observability**: the DB is the single source of truth for both state AND metrics.
|
|
|
|
---
|
|
|
|
## Our Metrics-Central.js Design
|
|
|
|
### What We Built
|
|
|
|
A Prometheus-compatible metrics collector with:
|
|
- Counter, Gauge, Histogram types
|
|
- In-memory aggregation with 60s flush to `.sf/runtime/sf-metrics.prom`
|
|
- Pre-defined metric metadata registry
|
|
- Wiring into subagent inheritance and mode transitions
|
|
|
|
### Design Decisions and Their Trade-offs
|
|
|
|
| Decision | Rationale | RA.Aid Comparison |
|
|
|----------|-----------|-------------------|
|
|
| **Prometheus text format** | Compatible with existing exposition, scrapeable by Grafana | RA.Aid uses DB queries; we support both |
|
|
| **In-memory aggregation** | Zero dependencies, fast | RA.Aid queries DB directly; we add a layer |
|
|
| **60s flush interval** | Batch writes, reduce I/O | RA.Aid writes per event; we batch |
|
|
| **Separate from trajectory/audit** | Metrics are aggregated views, not individual events | RA.Aid conflates events and metrics |
|
|
| **Metric metadata registry** | Pre-defined help text and labels | RA.Aid uses Peewee model definitions |
|
|
|
|
---
|
|
|
|
## The Review: 5 Lenses
|
|
|
|
### Lens 1: Data Model Consistency
|
|
|
|
**RA.Aid Pattern**: Single SQLite DB with typed models. Trajectory is the universal event log.
|
|
|
|
**Our Pattern**: Dual persistence:
|
|
- SQLite for operational state (UOK, sessions, tasks)
|
|
- Prometheus text file for metrics exposition
|
|
- JSONL for event durability
|
|
|
|
**Verdict**: ⚠️ **NEEDS WORK**
|
|
|
|
We have THREE observability sinks (SQLite, Prometheus file, JSONL) where RA.Aid has one. This creates:
|
|
- Risk of inconsistency between `sf-metrics.prom` and `sf.db`
|
|
- No unified query surface for "show me all subagent blocks in the last hour"
|
|
- Metrics file is write-only; no read path for programmatic consumption
|
|
|
|
**Recommendation**: Add a `metrics` table to `sf.db` that mirrors the Prometheus data model. The text file becomes a **projection**, not a source of truth.
|
|
|
|
```sql
|
|
CREATE TABLE metrics (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
name TEXT NOT NULL,
|
|
type TEXT NOT NULL CHECK(type IN ('counter', 'gauge', 'histogram')),
|
|
labels TEXT, -- JSON object
|
|
value REAL NOT NULL,
|
|
timestamp TEXT NOT NULL DEFAULT (datetime('now')),
|
|
session_id TEXT
|
|
);
|
|
```
|
|
|
|
### Lens 2: Event-Sourced vs Aggregated
|
|
|
|
**RA.Aid Pattern**: Every event is a row. Aggregation happens at query time.
|
|
|
|
**Our Pattern**: Aggregation happens at write time. Individual events are lost.
|
|
|
|
**Verdict**: ✅ **ACCEPTABLE for metrics, but incomplete for observability**
|
|
|
|
For counters and gauges, aggregation is correct. But for debugging "why was this subagent blocked?", we need the individual event, not just `sf_subagent_dispatch_blocked{reason="provider"} 5`.
|
|
|
|
**Recommendation**: Keep metrics-central for aggregated Prometheus output, but ALSO emit individual events to the audit/trajectory system. The metric is the summary; the trajectory is the detail.
|
|
|
|
### Lens 3: Context and Session Scoping
|
|
|
|
**RA.Aid Pattern**: Every record has a `session_id` foreign key. Contextvars scope the DB connection.
|
|
|
|
**Our Pattern**: Metrics are global to the process. No session scoping.
|
|
|
|
**Verdict**: ❌ **GAP**
|
|
|
|
Our metrics can't answer: "How many subagent dispatches were blocked in session X?" This is critical for:
|
|
- Per-session cost attribution
|
|
- Debugging why a specific run failed
|
|
- Multi-tenant scenarios (if SF ever serves multiple users)
|
|
|
|
**Recommendation**: Add `session_id` label to all metrics. Use `ctx.sessionId` or `getAutoSession().currentTraceId`.
|
|
|
|
### Lens 4: Cost and Token Tracking
|
|
|
|
**RA.Aid Pattern**: Every trajectory record has `current_cost`, `input_tokens`, `output_tokens`.
|
|
|
|
**Our Pattern**: No cost/token metrics in metrics-central yet.
|
|
|
|
**Verdict**: ❌ **MISSING**
|
|
|
|
RA.Aid tracks cost per tool call. We track cost in `metrics.js` (SQLite + JSONL) but not in metrics-central. This means:
|
|
- No Prometheus-compatible cost metrics
|
|
- No cost alerts from Grafana
|
|
- No cost attribution by work mode or permission profile
|
|
|
|
**Recommendation**: Add cost/token metrics:
|
|
```javascript
|
|
"sf_cost_total": { help: "Total cost in USD", labels: ["work_mode", "model_id"] },
|
|
"sf_tokens_input_total": { help: "Total input tokens", labels: ["model_id"] },
|
|
"sf_tokens_output_total": { help: "Total output tokens", labels: ["model_id"] },
|
|
```
|
|
|
|
### Lens 5: Error Handling and Resilience
|
|
|
|
**RA.Aid Pattern**: Every error is caught, logged, and stored in the trajectory with full context.
|
|
|
|
**Our Pattern**: `flushMetrics()` catches and logs with `logWarning()`. No retry.
|
|
|
|
**Verdict**: ⚠️ **ACCEPTABLE but could be stronger**
|
|
|
|
Our flush failure is best-effort, which matches RA.Aid's philosophy. But RA.Aid also:
|
|
- Reopens closed DB connections automatically
|
|
- Has fallback handlers for agent failures
|
|
- Records error details in the trajectory
|
|
|
|
**Recommendation**:
|
|
1. Add retry with exponential backoff for flush failures
|
|
2. If flush fails 3 times, emit a `metrics_flush_failed` counter
|
|
3. On process exit, attempt a final synchronous flush
|
|
|
|
---
|
|
|
|
## Specific Code Review Findings
|
|
|
|
### Finding 1: Unused Import
|
|
|
|
```javascript
|
|
import { isDbAvailable } from "./sf-db.js";
|
|
```
|
|
|
|
This is imported but never used. The JSDoc mentions "Optional SQLite persistence" but it's not implemented.
|
|
|
|
**Fix**: Either implement DB persistence or remove the import.
|
|
|
|
### Finding 2: Histogram Bucket Sorting
|
|
|
|
```javascript
|
|
this.buckets = [...buckets].sort((a, b) => a - b);
|
|
```
|
|
|
|
This mutates the input array (creates a copy first, so safe). But Prometheus expects buckets in ascending order, which is guaranteed.
|
|
|
|
**Verdict**: ✅ Correct.
|
|
|
|
### Finding 3: Label Key Serialization
|
|
|
|
```javascript
|
|
_key(labels) {
|
|
return this.labelNames.map((k) => `${k}=${labels[k] ?? ""}`).join(",");
|
|
}
|
|
```
|
|
|
|
If a label value contains `=` or `,`, the key parsing will break.
|
|
|
|
**Fix**: Add escaping or use a structured key format (e.g., JSON).
|
|
|
|
### Finding 4: No Validation on Metric Names
|
|
|
|
```javascript
|
|
export function recordCounter(name, labels = {}, amount = 1) {
|
|
const meta = getMetricMeta(name);
|
|
getRegistry().counter(name, meta.help, Object.keys(labels)).inc(labels, amount);
|
|
}
|
|
```
|
|
|
|
If `name` contains spaces or invalid Prometheus characters, the output will be malformed.
|
|
|
|
**Fix**: Add `validateMetricName(name)` that rejects invalid characters.
|
|
|
|
### Finding 5: Timer Unref
|
|
|
|
```javascript
|
|
if (_flushTimer.unref) _flushTimer.unref();
|
|
```
|
|
|
|
This is correct for Node.js but may not work in all environments (e.g., Bun).
|
|
|
|
**Verdict**: ✅ Acceptable with fallback.
|
|
|
|
---
|
|
|
|
## Overall Assessment
|
|
|
|
| Dimension | Grade | Notes |
|
|
|-----------|-------|-------|
|
|
| **Correctness** | B+ | Prometheus output is valid, but label escaping needs work |
|
|
| **Completeness** | B | Missing cost/token metrics, session scoping, DB persistence |
|
|
| **Consistency with SF** | A | Fits the extension model, uses existing patterns |
|
|
| **Consistency with RA.Aid** | C | RA.Aid would prefer event-sourced over aggregated |
|
|
| **Production Readiness** | B | Needs retry, validation, and DB projection before GA |
|
|
|
|
### Priority Fixes
|
|
|
|
1. **P0**: Add `session_id` label to all metrics
|
|
2. **P0**: Remove unused `isDbAvailable` import or implement DB persistence
|
|
3. **P1**: Add cost/token metrics
|
|
4. **P1**: Fix label value escaping
|
|
5. **P1**: Add metric name validation
|
|
6. **P2**: Add retry with backoff for flush failures
|
|
7. **P2**: Add final flush on process exit
|
|
8. **P2**: Consider a `metrics` table in `sf.db` as source of truth
|
|
|
|
### RA.Aid Patterns Worth Adopting
|
|
|
|
1. **Trajectory-style event logging**: Every metric should have a corresponding event in the audit/trajectory system
|
|
2. **Session-scoped connections**: All observability should be filterable by session
|
|
3. **Per-tool cost tracking**: Every tool call should record cost and tokens
|
|
4. **Error detail preservation**: When metrics indicate failure, the detail should be queryable
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
`metrics-central.js` is a solid Prometheus-compatible metrics layer that fills a real gap in SF's observability. However, it prioritizes **exposition format** over **observability depth**. RA.Aid's trajectory model is superior for debugging and audit because it preserves every event.
|
|
|
|
The right path forward:
|
|
1. Keep metrics-central for Prometheus output (Grafana compatibility)
|
|
2. Add a `metrics` table to `sf.db` for queryable aggregation
|
|
3. Ensure every metric has a corresponding audit/trajectory event
|
|
4. Add session scoping and cost tracking
|
|
|
|
This gives us the best of both worlds: Prometheus for dashboards, SQLite for queries, and trajectory for debugging.
|