singularity-forge/BUILD_PLAN.md
Mikael Hugo 7a6169705a docs: lock in fork stance, reframe cherry-pick list as reference-only
After attempting cluster B (4 surgical agent-session fixes), even the
first commit conflicted because of structural namespace divergence
(gsd_*→sf_* rename, @sf-run/*→@singularity-forge/* rename, prior
pi-mono direct cherry-picks). The conflicts are real semantic
divergence, not noise.

Conclusion: sf is a fork; we do not periodically sync from
gsd-build/gsd-2. Pretending we still track upstream means weeks of
merge work for diminishing return.

BUILD_PLAN.md adds an explicit "Upstream stance" section documenting
the fork posture and the rationale for the three irreversible naming
choices.

UPSTREAM_CHERRY_PICK_CANDIDATES.md is reframed as a reference list,
not an action plan. The clusters and SHAs remain useful as an
intelligence source — port specific fixes by hand when one bites us;
do not run automated cherry-picks against the list.

Pi-mono SDK syncs continue separately — that path doesn't have the
same divergence problem.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 12:57:44 +02:00

188 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# sf v3 Build Plan
A practical cut of the 56 NEW items in `SPEC.md` into tiers. Not every spec item is worth building for v3 — some were polish from late-stage adversarial review iterations and only matter at scale or in deployments we don't have.
This document is the answer to: **what should we actually ship for v3?**
It is opinionated. Each item has a tier and a one-line rationale. Reorder freely.
---
## Upstream stance
**sf is a fork.** We do not periodically sync from `gsd-build/gsd-2`.
We tried (see attempt log in `UPSTREAM_CHERRY_PICK_CANDIDATES.md`). The conflicts run deep because of three structural choices that are intentional and won't be reverted:
- We renamed `gsd_*` tool names → `sf_*` (`421fccd89`).
- We renamed `@sf-run/*``@singularity-forge/*` package scope (`f92ee8d64`).
- We've cherry-picked tool fixes from `pi-mono` upstream directly (`f153521c2`), which addresses some bugs that `gsd-2` fixed differently.
Pretending we still track gsd-2 means weeks of merge work for diminishing return. Better to:
- **Treat `gsd-build/gsd-2` upstream as an intelligence source.** We read it. We hand-port fixes when one specifically bites us. `UPSTREAM_CHERRY_PICK_CANDIDATES.md` is a reference list of what's available, not an action plan.
- **Pull from `pi-mono` directly for SDK improvements.** We've already been doing this; continue.
- **Track our own roadmap** via `SPEC.md` and this file.
If a specific upstream fix matters (e.g. a CVE, a bug we hit), port it manually and credit upstream in the commit message. Don't try to sync the whole tree.
---
## Tier 1 — ESSENTIAL (block v3 ship)
These resolve real product or correctness gaps. v3 isn't v3 without them.
### 1.1 Vault secret resolver
**Spec:** § 24, C-38, C-83.
**What:** `vault://secret/path#field` URI resolver, replacing any plaintext provider keys in current config. Auth chain: `VAULT_TOKEN``~/.vault-token` → AppRole.
**Why essential:** sf is a real tool used against real models with real billing. Plaintext keys in config files are a security regression we should not ship past.
**Effort:** 12 days. `pi-ai` config layer adds a resolver.
### 1.2 Singularity Memory integration decision + execution
**Spec:** § 16, § 24, C-94, C-95, K-01 through K-06.
**What:** Decide whether sm replaces sf's existing memory layer, layers on top, or stays absent — then execute. The repo at `singularity-ng/singularity-memory` exists; integrating means replacing or augmenting `memory-store.ts`, `memory-extractor.ts`, `memory-relations.ts`, `tools/memory-tools.ts`, `bootstrap/memory-tools.ts`.
**Why essential:** the spec leans heavily on sm (anti-patterns, two-bank recall, cross-tool sharing). Either commit to it or rewrite §16 to match what sf actually has.
**Recommended path:** **keep sf's local memory as a hot cache + use sm as durable cross-tool store**. This is the layered model — sf's local memory becomes the operational fast-path; sm holds long-term cross-session, cross-project, cross-tool memories.
**Effort:** 12 weeks for the integration; 1 day to decide.
### 1.3 Schema reconciliation: `units` vs `milestones`/`slices`/`tasks`
**Spec:** § 3.1.
**What:** sf has 3 tables, spec has 1 with a `type` column. Either:
- **(a)** Migrate sf to single `units` table (data migration; touches many files).
- **(b)** Update spec to 3-table model (no code change; spec rewrite).
**Recommended path:** **(b) — keep what sf has.** The 3-table shape is more granular and integrates with `decisions`, `requirements`, `artifacts`, `assessments`, `replan_history` which have rich schemas of their own. Forcing them into one `units` table loses information.
**Effort:** 23 days for spec rewrite, 0 days code.
### 1.4 Config schema alignment
**Spec:** § 14.2, C-25, C-26, C-73.
**What:** `config-overlay.ts` exposes whatever keys sf has today. Spec specifies `context_compact_at`, `context_hard_limit`, `unit_timeout`, `unit_timeout_by_phase`, `max_agents_by_phase`, `turn_input_required`, `worktree_mode`, `tool_abort_grace`, `max_turns_per_attempt`, `hot_cache_turns`, etc. Add missing keys with defaults; document each.
**Why essential:** users can't tune behavior they can't configure. Spec promises configurability that doesn't exist yet.
**Effort:** 35 days. Add keys, plumb through, write doctor checks.
---
## Tier 2 — STRONG (ship with v3 if possible, otherwise v3.1)
Real value-add. Defer is allowed but disappointing.
### 2.1 Persistent agents v1 (basic, no messaging)
**Spec:** § 17, A-01, A-02, A-03, A-04, A-09, A-10. **Defer:** A-05, A-06, A-07, A-08 (messaging) to v3.1.
**What:** named agents with their own memory blocks, system prompt, message history, durable across sessions. `core_memory_append` / `core_memory_replace` tools. `/sf agent run|reset|delete|inspect` commands.
**Why strong:** the persistent-agent pattern was the main draw from Letta and a recurring user interest throughout this spec process. Shipping basic persistent agents in v3 unlocks the architecture; messaging can come in v3.1.
**Effort:** 2 weeks for basic; +12 weeks for messaging.
### 2.2 Doc-sync sub-step
**Spec:** § 10.5, C-20, C-45, C-68.
**What:** at the end of the last code-mutating phase (Merge or, for spike workflows, Execute), run a `fast`-tier dispatch to check whether `ARCHITECTURE.md`/`CONVENTIONS.md`/`STACK.md` need updates and propose a diff for user approval.
**Why strong:** project docs rotting is the most predictable failure mode of long autopilot runs. Catching it costs ~5 minutes per merge.
**Effort:** 35 days.
### 2.3 Intent chapters
**Spec:** § 19.4, C-34.
**What:** spans grouped into named "what was the agent trying to do" chapters. Inferred from phase transitions or agent-declared via `chapter_open(name)`. Used for crash-resume context and Hindsight recall.
**Why strong:** crash-resume reconstruction is currently weak. Chapters give the resumed agent a coherent "what was I doing" header instead of replaying raw tool calls.
**Effort:** 1 week.
### 2.4 PhaseReview 3-pass review
**Spec:** § 13.3, C-39, C-63.
**What:** establish-context pass (single fast dispatch) → parallel chunked review (per-file, ≤300 lines each, standard tier) → synthesis pass.
**Why strong:** the current single-pass review on large diffs is known to gloss the tail. The 3-pass shape catches more.
**Effort:** 1 week.
### 2.5 `turn_status` marker
**Spec:** § 5.4.1, C-81.
**What:** parse `<turn_status>complete|blocked|giving_up</turn_status>` from end of agent output. `blocked` triggers `SignalPause`; `giving_up` transitions to `PhaseReassess` immediately.
**Why strong:** a per-turn semantic checkpoint between transport-success and phase-boundary. Currently the harness has no way to know "the agent thinks it's stuck" except by waiting for stuck-loop timeout.
**Effort:** 23 days.
### 2.6 `last_error` cap
**Spec:** § 7.3, C-74.
**What:** truncate `last_error` to 4 KB head+tail; full payload to `.sf/active/{unit-id}/last-error-full.txt`. Agent reads the file if needed.
**Why strong:** lint output / traceback dumps can blow the prompt. Current behaviour is "inject and pray."
**Effort:** 1 day.
### 2.7 Cost stored as integer micro-USD
**Spec:** C-69.
**What:** rename `cost_usd REAL``cost_micro_usd INTEGER` in `runs`, `benchmark_results`. Float drift on accumulated costs is real over thousands of runs.
**Why strong:** small change, real correctness improvement, easier reasoning about totals.
**Effort:** 1 day with the migration.
---
## Tier 3 — NICE (v3.1 or v3.2)
Worth building, just not blocking. Ship after Tier 2 if calendar allows.
| Item | Spec | One-line |
|---|---|---|
| Inter-agent messaging | § 18, A-05..A-08 | send_message + inbox + wait_for_reply + handoff. Builds on Tier 2.1 persistent agents. ~12 weeks. |
| Workflow content pinning | § 4.5, C-71 | SHA-256 hash of template content stored per unit; in-flight units use pinned content. Defends against operator editing the template mid-run. ~3 days. |
| Trace `_meta` record | § 19.3, C-79 | First line of each daily JSONL is a schema-version record. Forward-compatible. ~1 day. |
| `runs` table | § 3.1, C-48, C-49, C-59 | Unifies unit_attempt and agent_run history. sf has `audit_events` already; either repurpose or add a new view. Decision required. ~1 week. |
| `pending_retain` queue | § 16.1, C-51 | Sm retain failures queue locally and retry with backoff. Required if and only if sm is integrated (Tier 1.2). |
| Capability-tag handoff | § 18.4, C-82, C-90 | `handoff("capability:go,testing", ...)` resolves to any matching agent. Adds `agent_capabilities` index. Builds on Tier 2.1 + Tier 3 inter-agent messaging. ~3 days. |
| `agent_run` budget + termination | § 17.5, C-54, C-65 | When does an agent run end? (inbox drained / explicit stop / budget hard-limit / supervisor signal / timeout). Compaction preserves wake message. ~1 week. |
---
## Tier 4 — DEFER (only if a deployment actually demands it)
Spec sections that landed during late-stage adversarial review and only matter at scale or in specific deployments.
| Item | Spec | Why deferred |
|---|---|---|
| SSH worker extension | § 22, C-64, C-75, E-02 | Real for fleet deployments (bunker, inference-fabric scaling). Not real for daily-driver development. Build when a user actually needs to dispatch to a remote box. |
| HTTP API auth | § 19.5, C-77 | Only needed if the HTTP API ships. The MCP server (`packages/mcp-server`) is the more likely remote interface. |
| `trace_index` SQL | § 19.3.1, C-80 | Forensics over JSONL is fine until grep gets slow. Build the index when you have months of trace files, not before. |
| PhaseUAT | § 4.6, C-53, C-76 | Only matters for "release" workflows where humans sign off before merge. Add when needed. |
| Multi-orchestrator atomic claim | C-47 | The single-process `run.lock` is sufficient. The atomic UPDATE pattern matters when two orchestrators race against the same DB; sf doesn't deploy that way today. |
| `specs.check` JSDoc CI | C-37 | Useful but not blocking. Add when JSDoc rot becomes a real issue. |
---
## Tier 5 — DROP from spec
These crept in during adversarial review iterations and don't earn their keep.
| Item | Spec | Why drop |
|---|---|---|
| Cost-`per_1k_micro_usd` field type rename | C-69 (partial) | If we accept `cost_micro_usd` for runs (Tier 2.7), the `benchmark_results.cost_per_1k_micro_usd` rename is internally consistent — but the user-facing pricing model that benchmark uses already varies per provider; the integer-micro-USD constraint there is over-engineered. Keep `REAL` for benchmark, integer for runs. |
| `runs` snap_ columns (`unit_id_snap`, `agent_name_snap`) | C-59 | If we use soft-delete (`archived_at`) and never hard-delete, snapshots are unnecessary. Drop the columns. |
| `workflow_pins` content snapshot table | C-71 | If we just hash the file at first dispatch and store the hash on the unit (`units.workflow_hash`), we don't need a separate pins table. The hash is enough; the content can be re-read from disk. Simplify. |
| `agent_capabilities` separate indexed table | C-90 | At fleet sizes <100 agents, the JSON-array-LIKE scan is fine. Add the index when you have a measurement showing it's slow. |
---
## Suggested v3 milestone breakdown
**v3.0 — ship target: ~68 weeks**
- Tier 1.1 Vault (12d)
- Tier 1.2 sm integration, layered model (2 weeks)
- Tier 1.3 spec schema rewrite to 3-table (3d)
- Tier 1.4 config alignment (1 week)
- Tier 2.2 doc-sync (1 week)
- Tier 2.5 turn_status marker (3d)
- Tier 2.6 last_error cap (1d)
- Tier 2.7 cost_micro_usd (1d)
That's **~5 weeks of work** for the must-haves.
**v3.1 — ~4 weeks after v3.0**
- Tier 2.1 persistent agents v1 (2 weeks)
- Tier 2.3 intent chapters (1 week)
- Tier 2.4 PhaseReview 3-pass (1 week)
**v3.2 — when ready**
- Tier 3 items as appetite allows.
---
## Decisions needed before starting v3.0
1. **sm: replace, layer, or keep?** Recommended: layer (sf local cache + sm durable).
2. **Schema: migrate to single `units` or update spec to 3-table?** Recommended: update spec.
3. **Persistent agents in v3.0 or v3.1?** Recommended: v3.1 too much new surface to land alongside Tier 1 + 2.
4. **Does any deployment actually need SSH workers in v3.x?** If not, drop §22 from spec entirely; re-add when needed.