diff --git a/BUILD_PLAN.md b/BUILD_PLAN.md new file mode 100644 index 000000000..24c84df3e --- /dev/null +++ b/BUILD_PLAN.md @@ -0,0 +1,168 @@ +# sf v3 Build Plan + +A practical cut of the 56 NEW items in `SPEC.md` into tiers. Not every spec item is worth building for v3 — some were polish from late-stage adversarial review iterations and only matter at scale or in deployments we don't have. + +This document is the answer to: **what should we actually ship for v3?** + +It is opinionated. Each item has a tier and a one-line rationale. Reorder freely. + +--- + +## Tier 1 — ESSENTIAL (block v3 ship) + +These resolve real product or correctness gaps. v3 isn't v3 without them. + +### 1.1 Vault secret resolver +**Spec:** § 24, C-38, C-83. +**What:** `vault://secret/path#field` URI resolver, replacing any plaintext provider keys in current config. Auth chain: `VAULT_TOKEN` → `~/.vault-token` → AppRole. +**Why essential:** sf is a real tool used against real models with real billing. Plaintext keys in config files are a security regression we should not ship past. +**Effort:** 1–2 days. `pi-ai` config layer adds a resolver. + +### 1.2 Singularity Memory integration decision + execution +**Spec:** § 16, § 24, C-94, C-95, K-01 through K-06. +**What:** Decide whether sm replaces sf's existing memory layer, layers on top, or stays absent — then execute. The repo at `singularity-ng/singularity-memory` exists; integrating means replacing or augmenting `memory-store.ts`, `memory-extractor.ts`, `memory-relations.ts`, `tools/memory-tools.ts`, `bootstrap/memory-tools.ts`. +**Why essential:** the spec leans heavily on sm (anti-patterns, two-bank recall, cross-tool sharing). Either commit to it or rewrite §16 to match what sf actually has. +**Recommended path:** **keep sf's local memory as a hot cache + use sm as durable cross-tool store**. This is the layered model — sf's local memory becomes the operational fast-path; sm holds long-term cross-session, cross-project, cross-tool memories. +**Effort:** 1–2 weeks for the integration; 1 day to decide. + +### 1.3 Schema reconciliation: `units` vs `milestones`/`slices`/`tasks` +**Spec:** § 3.1. +**What:** sf has 3 tables, spec has 1 with a `type` column. Either: +- **(a)** Migrate sf to single `units` table (data migration; touches many files). +- **(b)** Update spec to 3-table model (no code change; spec rewrite). +**Recommended path:** **(b) — keep what sf has.** The 3-table shape is more granular and integrates with `decisions`, `requirements`, `artifacts`, `assessments`, `replan_history` which have rich schemas of their own. Forcing them into one `units` table loses information. +**Effort:** 2–3 days for spec rewrite, 0 days code. + +### 1.4 Config schema alignment +**Spec:** § 14.2, C-25, C-26, C-73. +**What:** `config-overlay.ts` exposes whatever keys sf has today. Spec specifies `context_compact_at`, `context_hard_limit`, `unit_timeout`, `unit_timeout_by_phase`, `max_agents_by_phase`, `turn_input_required`, `worktree_mode`, `tool_abort_grace`, `max_turns_per_attempt`, `hot_cache_turns`, etc. Add missing keys with defaults; document each. +**Why essential:** users can't tune behavior they can't configure. Spec promises configurability that doesn't exist yet. +**Effort:** 3–5 days. Add keys, plumb through, write doctor checks. + +--- + +## Tier 2 — STRONG (ship with v3 if possible, otherwise v3.1) + +Real value-add. Defer is allowed but disappointing. + +### 2.1 Persistent agents v1 (basic, no messaging) +**Spec:** § 17, A-01, A-02, A-03, A-04, A-09, A-10. **Defer:** A-05, A-06, A-07, A-08 (messaging) to v3.1. +**What:** named agents with their own memory blocks, system prompt, message history, durable across sessions. `core_memory_append` / `core_memory_replace` tools. `/sf agent run|reset|delete|inspect` commands. +**Why strong:** the persistent-agent pattern was the main draw from Letta and a recurring user interest throughout this spec process. Shipping basic persistent agents in v3 unlocks the architecture; messaging can come in v3.1. +**Effort:** 2 weeks for basic; +1–2 weeks for messaging. + +### 2.2 Doc-sync sub-step +**Spec:** § 10.5, C-20, C-45, C-68. +**What:** at the end of the last code-mutating phase (Merge or, for spike workflows, Execute), run a `fast`-tier dispatch to check whether `ARCHITECTURE.md`/`CONVENTIONS.md`/`STACK.md` need updates and propose a diff for user approval. +**Why strong:** project docs rotting is the most predictable failure mode of long autopilot runs. Catching it costs ~5 minutes per merge. +**Effort:** 3–5 days. + +### 2.3 Intent chapters +**Spec:** § 19.4, C-34. +**What:** spans grouped into named "what was the agent trying to do" chapters. Inferred from phase transitions or agent-declared via `chapter_open(name)`. Used for crash-resume context and Hindsight recall. +**Why strong:** crash-resume reconstruction is currently weak. Chapters give the resumed agent a coherent "what was I doing" header instead of replaying raw tool calls. +**Effort:** 1 week. + +### 2.4 PhaseReview 3-pass review +**Spec:** § 13.3, C-39, C-63. +**What:** establish-context pass (single fast dispatch) → parallel chunked review (per-file, ≤300 lines each, standard tier) → synthesis pass. +**Why strong:** the current single-pass review on large diffs is known to gloss the tail. The 3-pass shape catches more. +**Effort:** 1 week. + +### 2.5 `turn_status` marker +**Spec:** § 5.4.1, C-81. +**What:** parse `complete|blocked|giving_up` from end of agent output. `blocked` triggers `SignalPause`; `giving_up` transitions to `PhaseReassess` immediately. +**Why strong:** a per-turn semantic checkpoint between transport-success and phase-boundary. Currently the harness has no way to know "the agent thinks it's stuck" except by waiting for stuck-loop timeout. +**Effort:** 2–3 days. + +### 2.6 `last_error` cap +**Spec:** § 7.3, C-74. +**What:** truncate `last_error` to 4 KB head+tail; full payload to `.sf/active/{unit-id}/last-error-full.txt`. Agent reads the file if needed. +**Why strong:** lint output / traceback dumps can blow the prompt. Current behaviour is "inject and pray." +**Effort:** 1 day. + +### 2.7 Cost stored as integer micro-USD +**Spec:** C-69. +**What:** rename `cost_usd REAL` → `cost_micro_usd INTEGER` in `runs`, `benchmark_results`. Float drift on accumulated costs is real over thousands of runs. +**Why strong:** small change, real correctness improvement, easier reasoning about totals. +**Effort:** 1 day with the migration. + +--- + +## Tier 3 — NICE (v3.1 or v3.2) + +Worth building, just not blocking. Ship after Tier 2 if calendar allows. + +| Item | Spec | One-line | +|---|---|---| +| Inter-agent messaging | § 18, A-05..A-08 | send_message + inbox + wait_for_reply + handoff. Builds on Tier 2.1 persistent agents. ~1–2 weeks. | +| Workflow content pinning | § 4.5, C-71 | SHA-256 hash of template content stored per unit; in-flight units use pinned content. Defends against operator editing the template mid-run. ~3 days. | +| Trace `_meta` record | § 19.3, C-79 | First line of each daily JSONL is a schema-version record. Forward-compatible. ~1 day. | +| `runs` table | § 3.1, C-48, C-49, C-59 | Unifies unit_attempt and agent_run history. sf has `audit_events` already; either repurpose or add a new view. Decision required. ~1 week. | +| `pending_retain` queue | § 16.1, C-51 | Sm retain failures queue locally and retry with backoff. Required if and only if sm is integrated (Tier 1.2). | +| Capability-tag handoff | § 18.4, C-82, C-90 | `handoff("capability:go,testing", ...)` resolves to any matching agent. Adds `agent_capabilities` index. Builds on Tier 2.1 + Tier 3 inter-agent messaging. ~3 days. | +| `agent_run` budget + termination | § 17.5, C-54, C-65 | When does an agent run end? (inbox drained / explicit stop / budget hard-limit / supervisor signal / timeout). Compaction preserves wake message. ~1 week. | + +--- + +## Tier 4 — DEFER (only if a deployment actually demands it) + +Spec sections that landed during late-stage adversarial review and only matter at scale or in specific deployments. + +| Item | Spec | Why deferred | +|---|---|---| +| SSH worker extension | § 22, C-64, C-75, E-02 | Real for fleet deployments (bunker, inference-fabric scaling). Not real for daily-driver development. Build when a user actually needs to dispatch to a remote box. | +| HTTP API auth | § 19.5, C-77 | Only needed if the HTTP API ships. The MCP server (`packages/mcp-server`) is the more likely remote interface. | +| `trace_index` SQL | § 19.3.1, C-80 | Forensics over JSONL is fine until grep gets slow. Build the index when you have months of trace files, not before. | +| PhaseUAT | § 4.6, C-53, C-76 | Only matters for "release" workflows where humans sign off before merge. Add when needed. | +| Multi-orchestrator atomic claim | C-47 | The single-process `run.lock` is sufficient. The atomic UPDATE pattern matters when two orchestrators race against the same DB; sf doesn't deploy that way today. | +| `specs.check` JSDoc CI | C-37 | Useful but not blocking. Add when JSDoc rot becomes a real issue. | + +--- + +## Tier 5 — DROP from spec + +These crept in during adversarial review iterations and don't earn their keep. + +| Item | Spec | Why drop | +|---|---|---| +| Cost-`per_1k_micro_usd` field type rename | C-69 (partial) | If we accept `cost_micro_usd` for runs (Tier 2.7), the `benchmark_results.cost_per_1k_micro_usd` rename is internally consistent — but the user-facing pricing model that benchmark uses already varies per provider; the integer-micro-USD constraint there is over-engineered. Keep `REAL` for benchmark, integer for runs. | +| `runs` snap_ columns (`unit_id_snap`, `agent_name_snap`) | C-59 | If we use soft-delete (`archived_at`) and never hard-delete, snapshots are unnecessary. Drop the columns. | +| `workflow_pins` content snapshot table | C-71 | If we just hash the file at first dispatch and store the hash on the unit (`units.workflow_hash`), we don't need a separate pins table. The hash is enough; the content can be re-read from disk. Simplify. | +| `agent_capabilities` separate indexed table | C-90 | At fleet sizes <100 agents, the JSON-array-LIKE scan is fine. Add the index when you have a measurement showing it's slow. | + +--- + +## Suggested v3 milestone breakdown + +**v3.0 — ship target: ~6–8 weeks** + +- Tier 1.1 Vault (1–2d) +- Tier 1.2 sm integration, layered model (2 weeks) +- Tier 1.3 spec schema rewrite to 3-table (3d) +- Tier 1.4 config alignment (1 week) +- Tier 2.2 doc-sync (1 week) +- Tier 2.5 turn_status marker (3d) +- Tier 2.6 last_error cap (1d) +- Tier 2.7 cost_micro_usd (1d) + +That's **~5 weeks of work** for the must-haves. + +**v3.1 — ~4 weeks after v3.0** + +- Tier 2.1 persistent agents v1 (2 weeks) +- Tier 2.3 intent chapters (1 week) +- Tier 2.4 PhaseReview 3-pass (1 week) + +**v3.2 — when ready** + +- Tier 3 items as appetite allows. + +--- + +## Decisions needed before starting v3.0 + +1. **sm: replace, layer, or keep?** Recommended: layer (sf local cache + sm durable). +2. **Schema: migrate to single `units` or update spec to 3-table?** Recommended: update spec. +3. **Persistent agents in v3.0 or v3.1?** Recommended: v3.1 — too much new surface to land alongside Tier 1 + 2. +4. **Does any deployment actually need SSH workers in v3.x?** If not, drop §22 from spec entirely; re-add when needed.