# A2A Adoption Plan for Singularity-Forge — Production Grade

**Author:** Research synthesis  
**Date:** 2026-05-08  
**Status:** Draft — for review  
**Scope:** A2A as the internal agent communication protocol for SF dispatch layer

---

## Executive Summary

SF's 5 dispatch mechanisms + MessageBus are functionally complete but architecturally silos. A2A provides a standardized protocol that maps 1:1 onto SF's semantics. The existing MessageBus is preserved as the transport; A2A is the semantic layer on top.

**This is a production-grade plan.** Every section covers: error handling, failure modes, rollback procedures, observability, and testing strategy.

---

## Quick Reference

| Concern | Decision |
|---|---|
| A2A as internal protocol | YES — standardizes Task state, priority, capability discovery |
| MessageBus | Wrap as `A2AMessageService` transport; add `AgentRegistry` |
| Transport | SQLite-backed MessageBus (not HTTP/WebSocket) for local process agents |
| External A2A | Optional; wired later when HTTP exposure is needed |
| Migration | 6 phases; each phase is independently deployable and rollback-safe |
| Feature flag | `SF_A2A_ENABLED` — gates all new A2A behavior; default OFF until Phase 6 |

---

## 1. Architecture Overview

### 1.1 System Diagram

```
┌──────────────────────────────────────────────────────────────────────┐
│  Coordinator (UOK Kernel or subagent tool)                          │
│  ┌────────────────────────────────────────────────────────────┐   │
│  │  DispatchService                                               │   │
│  │  ├── A2AClient (send/receive)                               │   │
│  │  ├── AgentRegistry (capability lookup)                        │   │
│  │  └── AgentCard (self-description)                            │   │
│  └────────────────────────────────────────────────────────────┘   │
└───────────────────────────┬──────────────────────────────────────────┘
                            │ A2AMessageService (wraps MessageBus)
                            │ bus.send(), bus.broadcast(), bus.sendOnce()
                            ▼
┌──────────────────────────────────────────────────────────────────────┐
│  MessageBus (SQLite-backed, existing)                                  │
│  ├── Durable at-least-once delivery                                  │
│  ├── TTL-based auto-compaction                                      │
│  ├── AgentInbox per agent (per-queue)                              │
│  └── sendOnce for idempotent delivery                               │
└───────────────────────────┬──────────────────────────────────────────┘
                            │
                            ▼
┌──────────────────────────────────────────────────────────────────────┐
│  Worker Agents (git worktrees, one per milestone/slice)                 │
│  ├── AgentCard (role: worker, isolation: full)                        │
│  ├── AgentInbox subscription                                     │
│  ├── Project SQLite WAL (read/write)                               │
│  └── Emits: task_updated, cost, heartbeat                          │
└──────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────┐
│  Constrained Subagents (no project DB)                               │
│  ├── AgentCard (role: subagent, isolation: constrained)              │
│  ├── Limited tool scope (4 tools)                                   │
│  ├── AgentInbox (optional, opt-in via useMessageBus)                │
│  └── Returns structured output via A2A message                      │
└──────────────────────────────────────────────────────────────────────┘
```

### 1.2 A2A Semantic Mapping

| SF Concept | A2A Concept |
|---|---|
| milestone / slice / task | A2A Task (`id`, `status`, `metadata`) |
| UOK Kernel | A2A Client + Coordinator Agent |
| Worker (parallel orchestrator) | A2A Agent |
| MessageBus.send() | A2A MessageService.send() |
| MessageBus.sendOnce() | A2A idempotent delivery |
| MessageBus.broadcast() | A2A MessageService.broadcast() |
| AgentInbox per worker | A2A per-agent subscription queue |
| File-based status files | A2A AgentStatus (online/busy/idle/offline/error) |
| adversarial-partner/combatant/architect | A2A Agent with specialized capabilities |
| parallel / debate / chain modes | A2A CommunicationPattern |

---

## 2. A2A Type System

### 2.1 Core Types

```typescript
// File: src/resources/extensions/sf/dispatch/a2a-types.ts

import type {
  AgentCard,
  AgentCapabilities,
  Task,
  TaskStatus,
  Message,
} from "@a2a-js/sdk";

/**
 * A2A Task state — maps directly from SF unit runtime status.
 * These are the ONLY authoritative task states.
 */
export const A2A_TASK_STATES = [
  "submitted",
  "working",
  "completed",
  "failed",
  "cancelled",
] as const;
export type A2ATaskState = (typeof A2A_TASK_STATES)[number];

/**
 * SF-specific task extensions — runtime states that A2A doesn't model.
 * These live in task.metadata.sf_state and are NOT authoritative.
 * DB is the authority for these.
 */
export const SF_TASK_EXTENSIONS = [
  "verifying",
  "reviewing",
  "blocked",
  "paused",
  "retrying",
  "pending_input",
] as const;
export type SFTaskExtension = (typeof SF_TASK_EXTENSIONS)[number];

/**
 * Message priority levels — determines delivery urgency and retry budget.
 */
export const MESSAGE_PRIORITIES = ["low", "normal", "high", "urgent"] as const;
export type MessagePriority = (typeof MESSAGE_PRIORITIES)[number];

/**
 * Dispatch mode → A2A CommunicationPattern mapping.
 */
export const DISPATCH_TO_PATTERN: Record<string, string> = {
  single: "request_response",
  parallel: "notification",
  debate: "streaming",
  chain: "request_response",
};

/**
 * SF-specific capability extensions on top of A2A AgentCapabilities.
 */
export interface SFAgentCapabilities extends AgentCapabilities {
  /** Domain role */
  role: "coordinator" | "worker" | "subagent" | "reviewer" | "adversary" | "architect" | "researcher";
  /** Isolation level — determines DB access */
  isolation: "full" | "constrained";
  /** For constrained agents — which tools are permitted */
  toolScope?: Array<"file_read" | "file_write" | "execute" | "query" | "memory_read" | "memory_write">;
  /** Model tier for cost and routing decisions */
  modelTier: "primary" | "validation" | "worker";
  /** Domain specializations */
  specializations?: Array<
    | "milestone_planning"
    | "slice_planning"
    | "code_review"
    | "security_review"
    | "adversarial_review"
    | "architecture_analysis"
    | "research"
    | "verification"
  >;
}

/**
 * SF AgentCard — extends A2A AgentCard with SF-specific capabilities.
 * Published by each agent on startup; cached in AgentRegistry.
 */
export interface SFAgentCard extends AgentCard {
  capabilities: SFAgentCapabilities;
  metadata?: {
    basePath?: string;
    milestoneId?: string;
    sliceId?: string;
    worktreePath?: string;
    pid?: number;
    startedAt?: string;
  };
}

/**
 * SF Task metadata — stored in A2A Task.metadata.
 * sf_state is NOT authoritative — DB is the authority.
 */
export interface SFTaskMetadata {
  scope: "milestone" | "slice" | "task" | "inline";
  milestoneId: string;
  sliceId?: string;
  taskId?: string;
  title: string;
  /** Non-authoritative runtime hint — DB is authority */
  sf_state?: SFTaskExtension;
  /** Base path for DB access */
  basePath: string;
}

/**
 * A2A Message envelope used internally.
 * Wraps MessageBus messages with A2A metadata.
 */
export interface SFA2AMessage {
  id: string;
  type: "message" | "task_submitted" | "task_updated" | "task_completed" | "control" | "error";
  from: string;
  to: string | string[];
  body: Record<string, unknown>;
  priority: MessagePriority;
  sentAt: string;
  deliveredAt?: string;
  correlationId?: string;
  conversationId?: string;
  ttlMs?: number;
  taskId?: string;
  metadata?: Record<string, unknown>;
}
```

---

## 3. Error Handling

### 3.1 Message Delivery Errors

| Error | Detection | Response |
|---|---|---|
| Recipient offline | `AgentRegistry.getStatus() === "offline"` | Buffer message; deliver on reconnect |
| Inbox full (max 1000) | `AgentInbox.unreadCount >= maxInboxSize` | Reject with `TOO_MANY_PENDING`; caller retries with backoff |
| TTL exceeded | `Date.now() - sentAt > ttlMs` | Discard; caller notified via error response |
| DB write conflict | SQLite `SQLITE_BUSY` | Retry with exponential backoff (max 3 attempts, 100ms base) |
| Invalid recipient | `AgentRegistry.getCard(to) === undefined` | Return `AGENT_NOT_FOUND` error; do not retry |

### 3.2 Retry Strategy

```typescript
// File: src/resources/extensions/sf/dispatch/a2a-service.ts

const RETRY_CONFIG = {
  maxAttempts: 3,
  baseDelayMs: 100,
  maxDelayMs: 5000,
  backoffMultiplier: 2.0,
  jitterFactor: 0.1, // 10% random jitter to prevent thundering herd
} as const;

export class DeliveryError extends Error {
  constructor(
    message: string,
    public readonly code: string,
    public readonly retryable: boolean,
    public readonly attempts: number,
  ) {
    super(message);
    this.name = "DeliveryError";
  }
}

async function sendWithRetry(
  params: SendParams,
  attempt = 1,
): Promise<string> {
  const { from, to, body, metadata = {} } = params;

  try {
    return await doSend(from, to, body, metadata);
  } catch (err) {
    const isRetryable =
      err instanceof DeliveryError && err.retryable && attempt < RETRY_CONFIG.maxAttempts;

    if (!isRetryable) {
      throw err;
    }

    const delay = Math.min(
      RETRY_CONFIG.baseDelayMs * Math.pow(RETRY_CONFIG.backoffMultiplier, attempt - 1),
      RETRY_CONFIG.maxDelayMs,
    );
    const jitter = delay * RETRY_CONFIG.jitterFactor * Math.random();
    await sleep(delay + jitter);

    return sendWithRetry(params, attempt + 1);
  }
}
```

### 3.3 Agent Crash Handling

```
Worker crash detection:
  1. Worker process exits → SIGCHLD handler
  2. Update AgentRegistry status: "offline"
  3. MessageBus retains undelivered messages (TTL not expired)
  4. Coordinator polls AgentRegistry.getStatus() every 30s
  5. On reconnect: worker re-registers AgentCard
  6. Buffered messages delivered to reconnected AgentInbox
  7. Coordinator re-sends any unacknowledged task_updated messages
```

### 3.4 Panic Mode

When `messageService` fails to deliver HIGH/URGENT messages 3 times consecutively:

1. Log `A2A_DELIVERY_PANIC` event to `.sf/journal/`
2. Fall back to file-based signal (`session-status-io.js`)
3. Emit `sf_dispatch_degraded` event
4. Dashboard shows "dispatch degraded" warning
5. Auto-recovery when MessageBus recovers

---

## 4. Backpressure and Flow Control

### 4.1 Per-Agent Inbox Backpressure

```typescript
// File: src/resources/extensions/sf/dispatch/a2a-service.ts

const INBOX_CONFIG = {
  maxInboxSize: 1000,         // Per-agent queue limit
  maxMessageSizeBytes: 64 * 1024, // 64 KB per message body
  highWaterMark: 800,         // Warn when inbox reaches 80%
  overflowAction: "reject",    // "reject" | "drop_oldest"
} as const;

interface SendParams {
  from: string;
  to: string;
  body: Record<string, unknown>;
  metadata?: {
    priority?: MessagePriority;
    ttlMs?: number;
    replyTo?: string;
    taskId?: string;
  };
}

function validateSend(params: SendParams): void {
  const bodySize = JSON.stringify(params.body).length;
  if (bodySize > INBOX_CONFIG.maxMessageSizeBytes) {
    throw new DeliveryError(
      `Message body ${bodySize} bytes exceeds limit ${INBOX_CONFIG.maxMessageSizeBytes}`,
      "MESSAGE_TOO_LARGE",
      false, // Not retryable
      0,
    );
  }

  const inbox = bus.getInbox(params.to);
  if (inbox.unreadCount >= INBOX_CONFIG.maxInboxSize) {
    throw new DeliveryError(
      `Inbox for ${params.to} is full (${inbox.unreadCount}/${INBOX_CONFIG.maxInboxSize})`,
      "INBOX_OVERFLOW",
      true, // Retryable after inbox drains
      0,
    );
  }

  if (inbox.unreadCount >= INBOX_CONFIG.highWaterMark) {
    logWarning("dispatch", `Inbox for ${params.to} at ${inbox.unreadCount}/${INBOX_CONFIG.maxInboxSize}`);
  }
}
```

### 4.2 Coordinator Outbox Backpressure

When the coordinator sends faster than workers can consume:

```typescript
// Coordinator: batch outgoing messages, flush on interval
const outbox = new Map<string, SFA2AMessage[]>();
const FLUSH_INTERVAL_MS = 500;

setInterval(() => {
  for (const [to, messages] of outbox) {
    if (messages.length === 0) continue;
    bus.broadcast(coordinatorId, [to], { batch: messages });
    messages.length = 0; // drain
  }
}, FLUSH_INTERVAL_MS);

// Caller adds to outbox instead of sending immediately
function scheduleSend(params: SendParams): void {
  const queue = outbox.get(params.to) ?? [];
  queue.push(wrapAsA2AMessage(params));
  outbox.set(params.to, queue);
}
```

### 4.3 Memory Budget Per Worker

Each worker has a memory budget for buffering messages it cannot process immediately:

```
MAX_BUFFERED_MESSAGES_PER_WORKER = 100
MAX_BUFFERED_BYTES_PER_WORKER = 10 * 1024 * 1024  // 10 MB
```

If a worker's inbox exceeds either limit, the oldest messages are dropped (not rejected — the sender already moved on).

---

## 5. Observability

### 5.1 Metrics

```typescript
// File: src/resources/extensions/sf/dispatch/metrics.ts

export const A2A_METRICS = {
  // Message throughput
  "sf_a2a_messages_sent_total": {
    type: "counter",
    help: "Total A2A messages sent",
    labels: ["priority", "from_role", "to_role"],
  },
  "sf_a2a_messages_delivered_total": {
    type: "counter",
    help: "Total A2A messages delivered to recipient inbox",
    labels: ["priority", "from_role", "to_role"],
  },
  "sf_a2a_messages_failed_total": {
    type: "counter",
    help: "Total A2A message delivery failures",
    labels: ["priority", "error_code"],
  },
  "sf_a2a_message_delivery_latency_ms": {
    type: "histogram",
    help: "End-to-end message delivery latency (send to inbox receipt)",
    buckets: [10, 50, 100, 500, 1000, 5000],
  },
  "sf_a2a_inbox_size": {
    type: "gauge",
    help: "Current inbox size per agent",
    labels: ["agent_id", "role"],
  },
  "sf_a2a_retry_total": {
    type: "counter",
    help: "Total retry attempts",
    labels: ["priority", "attempt_number"],
  },
  "sf_a2a_agent_status": {
    type: "gauge",
    help: "Agent status (1=online, 0.5=busy, 0.1=idle, 0=offline/error)",
    labels: ["agent_id", "role"],
  },
} as const;
```

### 5.2 Structured Logging

Every A2A operation emits structured log lines:

```typescript
// File: src/resources/extensions/sf/dispatch/logger.ts

type A2ALogEvent =
  | { event: "a2a.send"; from: string; to: string; priority: string; messageId: string; sizeBytes: number }
  | { event: "a2a.delivered"; messageId: string; to: string; latencyMs: number }
  | { event: "a2a.delivery_failed"; messageId: string; error: string; retryable: boolean; attempt: number }
  | { event: "a2a.agent_registered"; agentId: string; role: string; capabilities: string[] }
  | { event: "a2a.agent_offline"; agentId: string; reason: string }
  | { event: "a2a.inbox_overflow"; agentId: string; size: number; action: string }
  | { event: "a2a.panic_mode"; reason: string; fallback_used: boolean };

function logA2A(event: A2ALogEvent): void {
  const line = JSON.stringify({
    ts: new Date().toISOString(),
    ...event,
  });
  workflowLogger.log("dispatch", line);
}
```

### 5.3 Trace Context

Propagate trace context through A2A messages for debugging:

```typescript
interface TraceContext {
  traceId: string;   // ULID — unique per dispatch session
  spanId: string;    // Per-message ID
  parentSpanId?: string;
}

function injectTraceContext(msg: SFA2AMessage): SFA2AMessage {
  const spanId = ulid();
  return {
    ...msg,
    metadata: {
      ...msg.metadata,
      trace: {
        traceId: currentTraceId(),
        spanId,
        parentSpanId: currentSpanId(),
      },
    },
  };
}
```

Traces are stored in `.sf/journal/a2a-traces/{date}.jsonl` and queryable via `sf trace <traceId>`.

---

## 6. Security

### 6.1 Agent Authentication

Every A2A message must carry a valid agent identity. Identity is established at agent startup:

```typescript
// File: src/resources/extensions/sf/dispatch/auth.ts

/**
 * Agent identity token — HMAC-SHA256 of agent ID + basePath + startup timestamp.
 * Used to authenticate messages from agents.
 * Generated once at agent startup; stored in process.env.SF_AGENT_TOKEN.
 */
function generateAgentToken(agentId: string, basePath: string): string {
  const secret = process.env.SF_A2A_SHARED_SECRET ?? process.env.SF_DB_KEY ?? "sf-insecure-dev-secret";
  const payload = `${agentId}:${basePath}:${Date.now()}`;
  return createHmac("sha256", secret).update(payload).digest("hex").slice(0, 32);
}

function verifyAgentToken(token: string, agentId: string): boolean {
  // Tokens are single-use (generated per startup, not reusable)
  // Verification is membership check: token must have been issued for this agentId
  return validTokens.has(`${agentId}:${token}`);
}
```

### 6.2 Input Validation

```typescript
// File: src/resources/extensions/sf/dispatch/validation.ts

const MAX_BODY_DEPTH = 20;       // Nested object depth
const MAX_ARRAY_LENGTH = 1000;   // Max array items in body
const MAX_STRING_LENGTH = 100_000; // Max string value length
const FORBIDDEN_KEYS = ["__proto__", "constructor", "prototype"]; // Prototype pollution

function validateMessageBody(body: unknown, depth = 0): void {
  if (depth > MAX_BODY_DEPTH) throw new ValidationError("BODY_TOO_DEEP");
  if (Array.isArray(body)) {
    if (body.length > MAX_ARRAY_LENGTH) throw new ValidationError("ARRAY_TOO_LARGE");
    for (const item of body) validateMessageBody(item, depth + 1);
    return;
  }
  if (typeof body === "object" && body !== null) {
    for (const [k, v] of Object.entries(body)) {
      if (FORBIDDEN_KEYS.includes(k)) throw new ValidationError(`FORBIDDEN_KEY: ${k}`);
      validateMessageBody(v, depth + 1);
    }
    return;
  }
  if (typeof body === "string" && body.length > MAX_STRING_LENGTH) {
    throw new ValidationError("STRING_TOO_LONG");
  }
}
```

### 6.3 Capability Enforcement

The `AgentRegistry` enforces that agents only perform actions consistent with their registered capabilities:

```typescript
// File: src/resources/extensions/sf/dispatch/capability-enforcer.ts

function enforceCapabilities(agentId: string, action: string): void {
  const card = registry.getCard(agentId);
  if (!card) throw new DeliveryError(`Unknown agent: ${agentId}`, "AGENT_NOT_FOUND", false, 0);

  const caps = card.capabilities as SFAgentCapabilities;

  switch (action) {
    case "write_project_db":
      if (caps.isolation !== "full") {
        throw new DeliveryError(
          `${agentId} cannot write project DB (isolation: ${caps.isolation})`,
          "ISOLATION_VIOLATION",
          false,
          0,
        );
      }
      break;
    case "send_to_worker":
      if (caps.role === "subagent") {
        // Constrained subagents can only send to their parent
        throw new DeliveryError("Subagent cannot send to workers", "CAPABILITY_DENIED", false, 0);
      }
      break;
    case "read_project_context":
      // All agents can read project context (it's in the prompt)
      break;
  }
}
```

---

## 7. Testing Strategy

### 7.1 Unit Tests

```typescript
// File: src/resources/extensions/sf/dispatch/a2a-service.test.ts

describe("A2AMessageService", () => {
  let bus: MessageBus;
  let registry: AgentRegistry;
  let service: A2AMessageService;

  beforeEach(() => {
    bus = new MessageBus(tmpDir());
    registry = new AgentRegistry(tmpDir(), bus);
    service = new A2AMessageService(tmpDir(), registry);
  });

  test("send_delivers_to_recipient_inbox", async () => {
    registry.register(workerCard("worker:1"));
    const id = service.send({
      from: "coordinator",
      to: "worker:1",
      body: { type: "task_submitted", taskId: "M01" },
    });
    const inbox = service.getInbox("worker:1");
    const msgs = inbox.list();
    expect(msgs).toHaveLength(1);
    expect(msgs[0].id).toBe(id);
  });

  test("sendWithRetry_retries_on_retryable_error", async () => {
    // Simulate transient DB busy
    vi.spyOn(bus, "send").mockRejectedOnceOnce(new DeliveryError("busy", "DB_BUSY", true, 1));
    vi.spyOn(bus, "send").mockResolvedValueOnce("msg-1");

    const id = await sendWithRetry({ from: "c", to: "w", body: { test: true } });
    expect(id).toBe("msg-1");
    expect(bus.send).toHaveBeenCalledTimes(2);
  });

  test("sendWithRetry_does_not_retry_non_retryable_error", async () => {
    vi.spyOn(bus, "send").mockRejectedValueOnce(
      new DeliveryError("unknown agent", "AGENT_NOT_FOUND", false, 1),
    );
    await expect(sendWithRetry({ from: "c", to: "w", body: { test: true } }))
      .rejects.toThrow("AGENT_NOT_FOUND");
    expect(bus.send).toHaveBeenCalledTimes(1);
  });

  test("sendOnce_same_key_returns_same_id", async () => {
    const id1 = service.sendOnce({ from: "c", to: "w", body: { beat: 1 }, dedupeKey: "heartbeat" });
    const id2 = service.sendOnce({ from: "c", to: "w", body: { beat: 2 }, dedupeKey: "heartbeat" });
    expect(id1).toBe(id2); // Idempotent
  });

  test("validateMessageBody_rejects_deep_objects", () => {
    const deep = { a: { b: { c: { d: { e: {} } } } };
    expect(() => validateMessageBody(deep, 0, MAX_BODY_DEPTH)).toThrow("BODY_TOO_DEEP");
  });

  test("validateMessageBody_rejects_prototype_pollution", () => {
    expect(() => validateMessageBody({ "__proto__": { evil: true } }, 0))
      .toThrow("FORBIDDEN_KEY");
  });
});
```

### 7.2 Integration Tests

```typescript
// File: src/resources/extensions/sf/tests/a2a-integration.test.ts

describe("A2A Integration", () => {
  test("worker_registers_and_receives_task", async () => {
    const { coordinator, worker, service } = setupTwoAgentSystem();

    // Worker starts, registers
    await worker.start();
    await waitFor(() => registry.getStatus("worker:1") === "online");

    // Coordinator sends task
    service.send({
      from: "coordinator",
      to: "worker:1",
      body: { type: "task_submitted", taskId: "M01" },
    });

    // Worker receives
    const msg = await worker.waitForMessage("task_submitted");
    expect(msg.body.taskId).toBe("M01");
  });

  test("worker_crash_does_not_lose_messages", async () => {
    const { coordinator, worker, service } = setupTwoAgentSystem();
    await worker.start();

    service.send({ from: "coordinator", to: "worker:1", body: { type: "task_submitted" } });

    // Worker crashes and restarts
    await worker.kill();
    await worker.start();

    // Message should still be in inbox after restart
    const msg = await worker.waitForMessage("task_submitted");
    expect(msg).toBeDefined();
  });

  test("coordinator_receives_worker_heartbeat", async () => {
    const { coordinator, worker, service } = setupTwoAgentSystem();
    await worker.start();

    worker.sendHeartbeat();

    const msg = await coordinator.waitForMessage("worker.heartbeat");
    expect(msg.from).toBe("worker:1");
  });
});
```

### 7.3 Chaos Tests

```typescript
// File: src/resources/extensions/sf/tests/a2a-chaos.test.ts

describe("A2A Chaos", () => {
  test("messages_delivered_despite_slow_worker", async () => {
    // Worker is slow to process (simulate 10s processing time)
    worker.simulateSlowProcessing(10_000);

    // Send 100 messages while worker is slow
    const sends = Array.from({ length: 100 }, (_, i) =>
      service.send({ from: "c", to: "w", body: { seq: i } }),
    );
    const results = await Promise.allSettled(sends);

    // All succeed (buffered, not rejected)
    expect(results.filter(r => r.status === "fulfilled")).toHaveLength(100);

    // Worker processes all after recovery
    worker.simulateFastProcessing();
    await worker.processAllBuffered();

    const received = await worker.getAllMessages();
    expect(received).toHaveLength(100);
  });

  test("panic_mode_activates_on_repeated_failure", async () => {
    bus.simulatePermanentFailure();

    for (let i = 0; i < 3; i++) {
      try {
        await service.send({ from: "c", to: "w", body: { test: true } });
      } catch {}
    }

    // Panic mode should be active
    expect(service.isPanicMode).toBe(true);
    // File-based fallback should be active
    expect(sessionStatusSignalWasUsed()).toBe(true);
  });
});
```

---

## 8. Rollback Procedures

### 8.1 Feature Flag

All A2A behavior is gated by `SF_A2A_ENABLED`:

```typescript
// File: src/resources/extensions/sf/dispatch/service.ts

const A2A_ENABLED = process.env.SF_A2A_ENABLED === "1";

export class DispatchService {
  private messageService: A2AMessageService | null = null;

  constructor(opts: DispatchOptions) {
    if (A2A_ENABLED) {
      this.messageService = new A2AMessageService(opts.basePath, this.registry);
    }
    // ...
  }

  async pause(workerId: string): Promise<void> {
    if (this.messageService && A2A_ENABLED) {
      await this.messageService.send({
        from: "coordinator",
        to: workerId,
        body: { type: "control", action: "pause" },
        metadata: { priority: "high" },
      });
    } else {
      // Legacy file-based signal
      sendSignal(this.basePath, workerId, "pause");
    }
  }
}
```

### 8.2 Per-Phase Rollback

| Phase | Rollback |
|---|---|
| Phase 1: A2A adapter types | Delete `a2a-types.ts`, `a2a-task.ts`. No behavior change — code not wired yet. |
| Phase 2: AgentRegistry | Delete `capability-registry.ts`. Remove registry from `DispatchService` constructor. No behavior change. |
| Phase 3: MessageBus wiring | Set `SF_A2A_ENABLED=0`. File-based IPC (`sendSignal`) is the automatic fallback. |
| Phase 4: Subagent A2A | Delete `subagent/a2a.ts`. Restore original `subagent/index.js` from git. |
| Phase 5: UOK kernel A2A | Revert `uok/kernel.js` to pre-Phase-5 state from git. |
| Phase 6: Fallback removal | `session-status-io.js` is never removed — it stays as crash-recovery fallback permanently. |

### 8.3 Emergency Rollback

```bash
# Emergency: disable A2A entirely
SF_A2A_ENABLED=0 sf headless autonomous

# Emergency: revert to specific phase
git stash
git checkout phase2-end  # tag or branch at end of Phase 2
SF_A2A_ENABLED=0 sf headless autonomous

# Verify rollback
npx vitest run src/resources/extensions/sf/tests/uok-message-bus.test.mjs
```

---

## 9. Migration Phases (Detailed)

### Phase 1: A2A Type Definitions (Week 1-2)
**Risk: Zero | Behavior: identical**

```
Files created:
  dispatch/a2a-types.ts        — A2A types + SF extensions
  dispatch/a2a-task.ts         — Task creation + state mapping
  dispatch/a2a-errors.ts      — DeliveryError + error codes

Files modified:
  None (types are additive, not wired)
```

**Verification:**
```bash
npx tsc --noEmit src/resources/extensions/sf/dispatch/a2a-types.ts
npx vitest run src/resources/extensions/sf/dispatch/a2a-task.test.ts
```

---

### Phase 2: AgentRegistry (Week 2-3)
**Risk: Low | Behavior: additive**

```
Files created:
  dispatch/capability-registry.ts  — AgentRegistry + SF_CAPABILITY_DEFINITIONS

Files modified:
  dispatch/service.ts             — Add registry to DispatchService (opt-in via feature flag)
  dispatch/index.ts               — Export new types
```

**Verification:**
```bash
npx vitest run src/resources/extensions/sf/dispatch/capability-registry.test.ts
SF_A2A_ENABLED=0 npm run test:unit  # existing tests pass
```

---

### Phase 3: MessageBus Wiring (Week 3-4)
**Risk: Medium | Behavior: pause/resume/stop now use MessageBus**

```
Files created:
  dispatch/a2a-service.ts   — A2AMessageService wrapping MessageBus

Files modified:
  dispatch/service.ts        — Wire MessageBus into pause/resume/stop
  dispatch/worker-*.ts       — Register AgentCard on spawn
  session-status-io.ts        — Mark as crash-recovery fallback (never primary)
```

**Before:** `sendSignal(basePath, id, "pause")` → signal file  
**After:** `messageService.send({ from, to, body: { type: "control", action: "pause" }, priority: HIGH })`  
**Fallback:** File signal if MessageBus delivery fails 3 times

**Verification:**
```bash
SF_A2A_ENABLED=1 npx vitest run src/resources/extensions/sf/tests/a2a-integration.test.ts
SF_A2A_ENABLED=0 npm run test:unit  # existing tests pass
```

---

### Phase 4: Subagent A2A (Week 4-5)
**Risk: Medium | Behavior: subagent modes unchanged**

```
Files modified:
  subagent/index.ts           — Use DispatchService internally
  dispatch/service.ts         — Handle isolation: constrained
```

**Verification:**
```bash
SF_A2A_ENABLED=1 npx vitest run src/resources/extensions/sf/tests/subagent-a2a.test.ts
SF_A2A_ENABLED=0 npm run test:unit  # existing tests pass
```

---

### Phase 5: UOK Kernel A2A (Week 5-6)
**Risk: Medium | Behavior: UOK autonomous loop uses A2A**

```
Files modified:
  uok/kernel.ts               — Use DispatchService + A2AMessageService
  uok/index.ts               — Export new A2A types
```

**Verification:**
```bash
SF_A2A_ENABLED=1 npm run test:integration  # Full integration suite
SF_A2A_ENABLED=0 npm run test:integration  # Legacy still works
```

---

### Phase 6: A2A Default On (Week 6-7)
**Risk: Low | Behavior: A2A is now the default**

```
Actions:
  1. Set SF_A2A_ENABLED=1 as default in preferences
  2. Document in CHANGELOG.md
  3. Monitor for 1 week before declaring stable
```

---

## 10. Operational Runbooks

### 10.1 Dispatch Degraded

**Symptoms:** Dashboard shows "dispatch degraded"; `sf_dispatch_degraded` events in journal

**Diagnosis:**
```bash
# Check MessageBus health
node -e "import('./src/resources/extensions/sf/uok/message-bus.js').then(m => {
  const metrics = m.getUokMessageBusMetrics();
  console.log(JSON.stringify(metrics, null, 2));
}')

# Check for panic mode
cat .sf/journal/*.jsonl | jq 'select(.event == "a2a.panic_mode")' | tail -5
```

**Fix:**
```bash
# Switch to file-based IPC temporarily
SF_A2A_ENABLED=0 sf headless autonomous

# Restart with A2A off
sf headless autonomous

# After fix: re-enable A2A
sf config set SF_A2A_ENABLED=1
```

### 10.2 Worker Not Receiving Messages

**Symptoms:** Worker shows "offline" but process is running

**Diagnosis:**
```bash
# Check worker AgentCard registration
curl -s http://localhost:3030/api/dispatch/agents | jq '.[] | select(.role == "worker")'

# Check worker inbox size
node -e "const m = require('./src/resources/extensions/sf/dispatch/metrics'); m.getInboxMetrics('worker:M01')"

# Check MessageBus delivery latency
cat .sf/journal/*.jsonl | jq 'select(.event == "a2a.delivery_failed")' | tail -20
```

**Fix:**
```bash
# Restart the worker process
sf parallel stop M01
sf parallel start M01

# Or: send SIGUSR1 to worker to re-register its AgentCard
kill -USR1 $(pgrep -f "sf.*M01")
```

### 10.3 Inbox Overflow

**Symptoms:** `"INBOX_OVERFLOW"` errors in logs; workers missing messages

**Diagnosis:**
```bash
# Find overflowing inboxes
node -e "import('./src/resources/extensions/sf/dispatch/metrics').then(m => {
  Object.entries(m.getAllInboxSizes()).forEach(([id, size]) => {
    if (size > 900) console.log(id, size);
  });
})"
```

**Fix:**
```bash
# Compact all message buses (removes messages older than retention)
sf uok messages compact

# Or: increase inbox size limit temporarily
SF_INBOX_MAX_SIZE=5000 sf headless autonomous
```

---

## 11. Performance Targets

| Metric | Target | Critical Threshold |
|---|---|---|
| Message delivery latency (local) | < 50ms p50, < 500ms p99 | > 2000ms |
| Inbox delivery for 100 parallel workers | < 5s end-to-end | > 15s |
| Agent registration time | < 100ms | > 1000ms |
| Message throughput | > 1000 msg/s per coordinator | < 100 msg/s |
| Memory per worker (idle) | < 50 MB | > 200 MB |
| Memory per coordinator (10 workers) | < 200 MB | > 500 MB |
| DB WAL size growth | < 10 MB/day | > 100 MB/day |
| Recovery time after coordinator crash | < 5s | > 30s |

---

## 12. File Manifest

### New Files

| File | Lines (est) | Purpose |
|---|---|---|
| `dispatch/a2a-types.ts` | 120 | Core A2A types + SF extensions |
| `dispatch/a2a-task.ts` | 80 | Task creation + state mapping |
| `dispatch/a2a-errors.ts` | 60 | DeliveryError + error codes |
| `dispatch/a2a-service.ts` | 250 | A2AMessageService wrapping MessageBus |
| `dispatch/capability-registry.ts` | 180 | AgentRegistry + SF_CAPABILITY_DEFINITIONS |
| `dispatch/metrics.ts` | 60 | A2A Prometheus metrics |
| `dispatch/logger.ts` | 40 | A2A structured logging |
| `dispatch/validation.ts` | 70 | Message body validation |
| `dispatch/auth.ts` | 50 | Agent token generation + verification |
| `dispatch/index.ts` | 30 | Barrel exports |
| `dispatch/a2a-service.test.ts` | 200 | Unit tests |
| `tests/a2a-integration.test.ts` | 300 | Integration tests |
| `tests/a2a-chaos.test.ts` | 150 | Chaos tests |
| **Total new** | **~1600 LOC** | |

### Modified Files

| File | Change |
|---|---|
| `dispatch/service.ts` | Add registry + messageService; wire pause/resume/stop |
| `dispatch/worker-orchestrator.ts` | Register AgentCard on spawn; open AgentInbox |
| `uok/kernel.ts` | Register coordinator AgentCard; use DispatchService |
| `uok/message-bus.js` | Add AgentCard types (no behavior change) |
| `uok/index.ts` | Export A2A types |
| `subagent/index.ts` | Use DispatchService; remove ~600 LOC spawn management |
| `session-status-io.ts` | Mark as crash-recovery fallback only |

---

## Summary

| Question | Answer |
|---|---|
| A2A as internal protocol | YES — Task state, priority, capability discovery |
| Transport | SQLite MessageBus (not HTTP/WebSocket) |
| External A2A | Optional; wired later |
| Feature flag | `SF_A2A_ENABLED` gates all behavior |
| Migration | 6 phases; each independently rollback-safe |
| Error handling | Retry with exponential backoff; panic mode with file-based fallback |
| Backpressure | Per-inbox limits; coordinator outbox batching |
| Observability | Prometheus metrics + structured JSONL logging |
| Security | Agent tokens, input validation, capability enforcement |
| Testing | Unit + integration + chaos tests for every phase |
| Rollback | `SF_A2A_ENABLED=0` disables all new behavior instantly |