Mikael Hugo a3c487c918 docs: add Tier 0 (pi-mono ports) and Tier 0.5 (gsd-2 manual ports) — sf does these first

Tier 0 (pi-mono — should land cleanly via cherry-pick, no namespace divergence):
9 items ranked security → bug-fixes → infra → features.

  Critical:
    1. HTML export escape (security)
    2. Empty tools array fix (provider compatibility)
    3. Anthropic SSE proxy event tolerance
    4. Long local-LLM SSE 5min timeout fix

  Infrastructure:
    5. Bedrock inference profile normalization
    6. Symlinked packages dedup
    7. ctx.ui.setWorkingVisible() extension API

  Features:
    8. Cloudflare Workers AI provider
    9. Azure Cognitive Services endpoint

Tier 0.5 (gsd-2 — must be MANUALLY ported; cherry-pick fails on namespace):

  Critical fixes (11):
    1-6.  bash race, security hardening, web_search injection narrowing,
          symlinked staging self-heal, KNOWLEDGE budget, mcp-server deadlock
    7-10. agent_end transition fixes (4 commits)
    11.   claude-code-cli Always-Allow persistence

  Normal-value features (6):
    12. /gsd eval-review slim port (prompt + tool + template)
    13. Workflow state machine hardening (5 commits as unit)
    14. Proactive rate limiting (min_request_interval_ms)
    15. Per-call token telemetry (opt-in pi-coding-agent hooks)
    16. Worktree TUI commands
    17. Doctor check for orphan milestone directories

Skipped from each upstream is documented. All in BUILD_PLAN.md so sf
can work the list systematically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-29 14:04:31 +02:00

21 KiB

Raw Blame History

sf v3 Build Plan

A practical cut of the 56 NEW items in SPEC.md into tiers. Not every spec item is worth building for v3 — some were polish from late-stage adversarial review iterations and only matter at scale or in deployments we don't have.

This document is the answer to: what should we actually ship for v3?

Tier 0 — Pi-mono ports (sf: do these FIRST)

Pi-mono (badlogic/pi-mono) has shipped 4 releases (v0.70.3 → v0.70.6) since our last vendor sync. These should be picked up before other v3 work because:

They're security/correctness fixes for code we already use.
They land cleanly (no namespace divergence — packages/pi-* were vendored from pi-mono with same paths and type names).
Skipping them means dragging known bugs into v3 work.

Order: security first → real bugs → infra → features.

Order	Pi-mono fix	Why	Reference (pi-mono SHA)
1	HTML export: escape image data + session metadata	Security — crafted session content could inject markup in exported HTML	PRs #3819, #3883 (in v0.70.6)
2	Empty `tools` array fix for providers that reject	Correctness bug — some providers reject the call	PR #3650 (in v0.70.3)
3	Anthropic SSE: ignore unknown proxy events	Correctness bug — proxies emit OpenAI-style `done` events that crash our parser	issue #3708 (in v0.70.3)
4	Long local-LLM SSE timeout (5-min undici cutoff)	Correctness bug — local Ollama / LM Studio sessions over 5 min die with `UND_ERR_BODY_TIMEOUT`	issue #3715 (in v0.70.3)
5	Bedrock inference profile normalization	Bedrock prompt-caching + adaptive-thinking checks fail on inference profile ARNs	PR #3527 (in v0.70.3)
6	Symlinked packages/resources/skills/sessions dedup	Selectors and loaders show duplicates when paths are symlinked	PR #3818 (in v0.70.3)
7	`ctx.ui.setWorkingVisible()` extension API	Lets extensions hide the built-in working-loader row; useful for autopilot UX	issue #3674 (in v0.70.3)
8	Cloudflare Workers AI provider	New provider option (`CLOUDFLARE_API_KEY`/`CLOUDFLARE_ACCOUNT_ID`)	PR #3851 (in v0.70.6)
9	Azure Cognitive Services endpoint	Azure OpenAI Responses base URL support	PR #3799 (in v0.70.3)

Process for each: read the pi-mono commit, port the fix to our packages/pi-* (cherry-pick should work cleanly here — same namespace as upstream); commit with port(pi-mono): <description> (refs <pi-mono SHA>) style.

Skip from pi-mono (not applicable to us):

pi update --self, pi.dev update endpoint, Windows self-update — we vendor; no pi-binary auto-update path
Bun startup / sandbox /proc/self/environ fixes — we run on Node, not Bun
Packaged session selector import — our dist layout differs

Tier 0.5 — gsd-2 high-value manual ports (after Tier 0)

gsd-build/gsd-2 has 4,589 commits we're missing. Cherry-pick fails on virtually all of them because of our namespace divergence (gsd_* → sf_* rename, extensions/gsd/ → extensions/sf/ rename, prior pi-mono direct cherry-picks). These have to be manually ported — read the commit, write equivalent code against our paths and naming.

Process for each:

Read the commit at gsd-build/gsd-2 (we have it as upstream/main).
Find the equivalent file(s) in our extensions/sf/ tree.
Apply the fix manually with gsd_* → sf_* and .gsd/ → .sf/ translations.
Commit with port(gsd-2): <description> (refs <gsd-2 SHA>) style.

Critical fixes worth porting (limit to security + correctness; skip parallel-evolution churn):

Order	gsd-2 fix	Why	gsd-2 SHA
1	`fix(safety): persist bash evidence at tool_call` (close mid-unit re-dispatch race)	Real race condition; bash tool calls can lose evidence between dispatch and re-dispatch	`da7dd56e7` (PR #5056 → #5058)
2	`fix(security): harden project-controlled surfaces`	We have a partial cherry-pick at `66ff949c1`; supersede with the full fix	`65ca5aa2e`
3	`fix(search): narrow native web_search injection`	Only inject web_search context when the provider accepts it	`4370bedf3`
4	`fix(gsd): self-heal symlinked .gsd staging`	Data-loss prevention — symlinked staging dir was being treated as the wrong scope	`9340f1e9b` (#4423)
5	`fix(knowledge): scope + budget milestone KNOWLEDGE injection`	Prevents milestone-scope knowledge from blowing the context budget	`58d3d4d6c` (#4721)
6	`fix(mcp-server): prevent defaultExecFn stdout-buffer deadlock`	Real deadlock — large-output MCP tools could hang the agent	`bb747ec57`
7	`fix(agent-session): guard synthetic agent_end transitions`	Session-transition race when agent_end was synthesised	`71114fccf`
8	`fix(agent-session): skip idle wait after agent_end`	Idle wait was burning time on a session that was already ending	`6d7e4ccb5`
9	`Fix agent_end session switch handoff`	Session handoff during agent_end could drop the next session	`c162c44bf`
10	`Fix session transition during agent_end`	Companion to the above	`e3bd04551`
11	`fix(claude-code-cli): persist Always Allow for non-Bash tools`	Always-Allow grants didn't persist for non-Bash tools	`a88baeae9` (PR #5096)

Normal-value features worth porting (not critical, but real):

Order	gsd-2 feature	Why	Effort	gsd-2 SHA(s)
12	`/gsd eval-review` (slim, like product-audit)	New milestone-end evaluation review command + frontmatter schema. We don't have it. Slim port pattern: prompt + tool + workflow template; skip parallel rewrites of dispatch/prompts.	2 hrs	`979487735` `6971f4333` `a2f8f0e08` `83bcb054c` `a686d22cb` (+11 polish commits)
13	Workflow state machine hardening (5 commits as a unit)	`harden workflow state transitions`, `persist workflow retry and summary state`, `fail closed on unreadable milestone summaries`, `restore slice dependency fallback`. Reliability of long auto runs.	2 hrs	`f2377eedd` `b9a1c6743` `153fb328a` `381ccdef5` `371b2eb31` (PR #4758)
14	Proactive rate limiting via `min_request_interval_ms`	Self-throttle to avoid 429s — model-side rate-limit data is observability-only (per SPEC.md §19.6); this is the per-dispatch knob.	1 hr	`f980929f1` `73bc4d2f1` (PR #5007)
15	Per-call token telemetry (opt-in)	pi-coding-agent gains opt-in per-call token telemetry hooks. Useful for cost dashboards.	0.5 hr	`b4d4725ad` (PR #5023)
16	Worktree TUI commands (`worktree {list,merge,clean,remove}`)	Adds these to the TUI dispatcher. We may have parts of this; check before porting.	1 hr	`2361ceeb1` (PR #5055)
17	Doctor check for orphan milestone directories	Diagnostic — flags `.sf/active/` artifacts whose milestones are gone. Aligns with SPEC.md C-24 startup cleanup.	0.5 hr	`420354f99` (PR #4998)

Skip from gsd-2 (parallel evolution; we have own implementations):

auto-dispatch.ts, auto-prompts.ts, benchmark-selector.ts rewrites — we have these and ours are richer (e.g. our benchmark-selector has more eval types).
UnitContextManifest / Composer rewrite (~15 commits, PRs #4782 / #4924 / #4925 / #4926) — major architectural refactor that conflicts heavily; revisit during v3 §3 schema reconciliation.
xiaomi/minimax/product-audit features — already ported in commits ae0bbe32f, 2eebeccb9, a8cf2cd94.
All headless UX, prompt edits (DeepWiki/Context7), Serena hints, and global MCP loading — already addressed in our session (commits c41912ff5, dff0df5fd); we have own equivalents.

See UPSTREAM_CHERRY_PICK_CANDIDATES.md for the full audit (all 4,589 commits surveyed; this Tier 0.5 list is the 17 worth porting — 11 critical + 6 normal value).

Tier 1+ active follow-ups (after Tier 0 lands)

These came up during recent ports and refactor passes — tracked here so they don't get lost.

Follow-up	Why	Tier	Effort
Minimax search tests	Search agent ported the feature but explicitly skipped tests because bunker's tests don't match our preferences/provider export shape. Need: `getMiniMaxSearchApiKey()` priority order, `resolveSearchProvider()` returning "minimax", `/search-provider minimax` CLI behavior, no-key error messages, `executeMiniMaxSearch` request shape.	1	0.5 day
Product-audit phase machine wire-up	Slim port (commit `a8cf2cd94`) shipped the prompt + `sf_product_audit` tool + workflow template, but doesn't yet dispatch into PhaseMerge or PhaseComplete. The tool is callable; the phase doesn't auto-fire.	2	0.5 day
Headless assistant-text preview	Headless UX commit (`dff0df5fd`) covered notification spam, categorization, and phase/status tag distinction. The fourth bunker improvement — separating `assistantTextBuffer` from `thinkingBuffer` and flushing both as concise previews on tool-execution-start / message-end — was deferred because it's a meatier change in `headless.ts`.	2	0.5 day
Search provider registry refactor	Adding minimax took 9 files because the provider list is duplicated across `provider.ts` (type + VALID_PREFERENCES), `native-search.ts`, `command-search-provider.ts` (CLI), `tool-search.ts` + `tool-llm-context.ts` (two separate execute paths!), `preferences-types.ts`, `preferences-validation.ts`, manifest, docs. A single `SearchProviderRegistry` array would let everything iterate.	2	3-5 days
Pi-mono SDK sync	We pull from pi-mono directly (separate from gsd-2 sync stance). Periodically check `pi-mono/main` for SDK improvements worth taking. The remote is set up; cadence is not.	3	recurring

It is opinionated. Each item has a tier and a one-line rationale. Reorder freely.

Upstream stance

sf is a fork. We do not periodically sync from gsd-build/gsd-2.

We tried (see attempt log in UPSTREAM_CHERRY_PICK_CANDIDATES.md). The conflicts run deep because of three structural choices that are intentional and won't be reverted:

We renamed gsd_* tool names → sf_* (421fccd89).
We renamed @sf-run/* → @singularity-forge/* package scope (f92ee8d64).
We've cherry-picked tool fixes from pi-mono upstream directly (f153521c2), which addresses some bugs that gsd-2 fixed differently.

Pretending we still track gsd-2 means weeks of merge work for diminishing return. Better to:

Treat gsd-build/gsd-2 upstream as an intelligence source. We read it. We hand-port fixes when one specifically bites us. UPSTREAM_CHERRY_PICK_CANDIDATES.md is a reference list of what's available, not an action plan.
Pull from pi-mono directly for SDK improvements. We've already been doing this; continue.
Track our own roadmap via SPEC.md and this file.

If a specific upstream fix matters (e.g. a CVE, a bug we hit), port it manually and credit upstream in the commit message. Don't try to sync the whole tree.

Tier 1 — ESSENTIAL (block v3 ship)

These resolve real product or correctness gaps. v3 isn't v3 without them.

1.1 Vault secret resolver

Spec: § 24, C-38, C-83.
What: vault://secret/path#field URI resolver, replacing any plaintext provider keys in current config. Auth chain: VAULT_TOKEN → ~/.vault-token → AppRole.
Why essential: sf is a real tool used against real models with real billing. Plaintext keys in config files are a security regression we should not ship past.
Effort: 1–2 days. pi-ai config layer adds a resolver.

1.2 Singularity Memory integration decision + execution

Spec: § 16, § 24, C-94, C-95, K-01 through K-06.
What: Decide whether sm replaces sf's existing memory layer, layers on top, or stays absent — then execute. The repo at singularity-ng/singularity-memory exists; integrating means replacing or augmenting memory-store.ts, memory-extractor.ts, memory-relations.ts, tools/memory-tools.ts, bootstrap/memory-tools.ts.
Why essential: the spec leans heavily on sm (anti-patterns, two-bank recall, cross-tool sharing). Either commit to it or rewrite §16 to match what sf actually has.
Recommended path: keep sf's local memory as a hot cache + use sm as durable cross-tool store. This is the layered model — sf's local memory becomes the operational fast-path; sm holds long-term cross-session, cross-project, cross-tool memories.
Effort: 1–2 weeks for the integration; 1 day to decide.

1.3 Schema reconciliation: `units` vs `milestones`/`slices`/`tasks`

Spec: § 3.1.
What: sf has 3 tables, spec has 1 with a type column. Either:

(a) Migrate sf to single units table (data migration; touches many files).
(b) Update spec to 3-table model (no code change; spec rewrite).
Recommended path: (b) — keep what sf has. The 3-table shape is more granular and integrates with decisions, requirements, artifacts, assessments, replan_history which have rich schemas of their own. Forcing them into one units table loses information.
Effort: 2–3 days for spec rewrite, 0 days code.

1.4 Config schema alignment

Spec: § 14.2, C-25, C-26, C-73.
What: config-overlay.ts exposes whatever keys sf has today. Spec specifies context_compact_at, context_hard_limit, unit_timeout, unit_timeout_by_phase, max_agents_by_phase, turn_input_required, worktree_mode, tool_abort_grace, max_turns_per_attempt, hot_cache_turns, etc. Add missing keys with defaults; document each.
Why essential: users can't tune behavior they can't configure. Spec promises configurability that doesn't exist yet.
Effort: 3–5 days. Add keys, plumb through, write doctor checks.

Tier 2 — STRONG (ship with v3 if possible, otherwise v3.1)

Real value-add. Defer is allowed but disappointing.

2.1 Persistent agents v1 (basic, no messaging)

Spec: § 17, A-01, A-02, A-03, A-04, A-09, A-10. Defer: A-05, A-06, A-07, A-08 (messaging) to v3.1.
What: named agents with their own memory blocks, system prompt, message history, durable across sessions. core_memory_append / core_memory_replace tools. /sf agent run|reset|delete|inspect commands.
Why strong: the persistent-agent pattern was the main draw from Letta and a recurring user interest throughout this spec process. Shipping basic persistent agents in v3 unlocks the architecture; messaging can come in v3.1.
Effort: 2 weeks for basic; +1–2 weeks for messaging.

2.2 Doc-sync sub-step

Spec: § 10.5, C-20, C-45, C-68.
What: at the end of the last code-mutating phase (Merge or, for spike workflows, Execute), run a fast-tier dispatch to check whether ARCHITECTURE.md/CONVENTIONS.md/STACK.md need updates and propose a diff for user approval.
Why strong: project docs rotting is the most predictable failure mode of long autopilot runs. Catching it costs ~5 minutes per merge.
Effort: 3–5 days.

2.3 Intent chapters

Spec: § 19.4, C-34.
What: spans grouped into named "what was the agent trying to do" chapters. Inferred from phase transitions or agent-declared via chapter_open(name). Used for crash-resume context and Hindsight recall.
Why strong: crash-resume reconstruction is currently weak. Chapters give the resumed agent a coherent "what was I doing" header instead of replaying raw tool calls.
Effort: 1 week.

2.4 PhaseReview 3-pass review

Spec: § 13.3, C-39, C-63.
What: establish-context pass (single fast dispatch) → parallel chunked review (per-file, ≤300 lines each, standard tier) → synthesis pass.
Why strong: the current single-pass review on large diffs is known to gloss the tail. The 3-pass shape catches more.
Effort: 1 week.

2.5 `turn_status` marker

Spec: § 5.4.1, C-81.
What: parse <turn_status>complete|blocked|giving_up</turn_status> from end of agent output. blocked triggers SignalPause; giving_up transitions to PhaseReassess immediately.
Why strong: a per-turn semantic checkpoint between transport-success and phase-boundary. Currently the harness has no way to know "the agent thinks it's stuck" except by waiting for stuck-loop timeout.
Effort: 2–3 days.

2.6 `last_error` cap

Spec: § 7.3, C-74.
What: truncate last_error to 4 KB head+tail; full payload to .sf/active/{unit-id}/last-error-full.txt. Agent reads the file if needed.
Why strong: lint output / traceback dumps can blow the prompt. Current behaviour is "inject and pray."
Effort: 1 day.

2.7 Cost stored as integer micro-USD

Spec: C-69.
What: rename cost_usd REAL → cost_micro_usd INTEGER in runs, benchmark_results. Float drift on accumulated costs is real over thousands of runs.
Why strong: small change, real correctness improvement, easier reasoning about totals.
Effort: 1 day with the migration.

Tier 3 — NICE (v3.1 or v3.2)

Worth building, just not blocking. Ship after Tier 2 if calendar allows.

Item	Spec	One-line
Inter-agent messaging	§ 18, A-05..A-08	send_message + inbox + wait_for_reply + handoff. Builds on Tier 2.1 persistent agents. ~1–2 weeks.
Workflow content pinning	§ 4.5, C-71	SHA-256 hash of template content stored per unit; in-flight units use pinned content. Defends against operator editing the template mid-run. ~3 days.
Trace `_meta` record	§ 19.3, C-79	First line of each daily JSONL is a schema-version record. Forward-compatible. ~1 day.
`runs` table	§ 3.1, C-48, C-49, C-59	Unifies unit_attempt and agent_run history. sf has `audit_events` already; either repurpose or add a new view. Decision required. ~1 week.
`pending_retain` queue	§ 16.1, C-51	Sm retain failures queue locally and retry with backoff. Required if and only if sm is integrated (Tier 1.2).
Capability-tag handoff	§ 18.4, C-82, C-90	`handoff("capability:go,testing", ...)` resolves to any matching agent. Adds `agent_capabilities` index. Builds on Tier 2.1 + Tier 3 inter-agent messaging. ~3 days.
`agent_run` budget + termination	§ 17.5, C-54, C-65	When does an agent run end? (inbox drained / explicit stop / budget hard-limit / supervisor signal / timeout). Compaction preserves wake message. ~1 week.

Tier 4 — DEFER (only if a deployment actually demands it)

Spec sections that landed during late-stage adversarial review and only matter at scale or in specific deployments.

Item	Spec	Why deferred
SSH worker extension	§ 22, C-64, C-75, E-02	Real for fleet deployments (bunker, inference-fabric scaling). Not real for daily-driver development. Build when a user actually needs to dispatch to a remote box.
HTTP API auth	§ 19.5, C-77	Only needed if the HTTP API ships. The MCP server (`packages/mcp-server`) is the more likely remote interface.
`trace_index` SQL	§ 19.3.1, C-80	Forensics over JSONL is fine until grep gets slow. Build the index when you have months of trace files, not before.
PhaseUAT	§ 4.6, C-53, C-76	Only matters for "release" workflows where humans sign off before merge. Add when needed.
Multi-orchestrator atomic claim	C-47	The single-process `run.lock` is sufficient. The atomic UPDATE pattern matters when two orchestrators race against the same DB; sf doesn't deploy that way today.
`specs.check` JSDoc CI	C-37	Useful but not blocking. Add when JSDoc rot becomes a real issue.

Tier 5 — DROP from spec

These crept in during adversarial review iterations and don't earn their keep.

Item	Spec	Why drop
Cost-`per_1k_micro_usd` field type rename	C-69 (partial)	If we accept `cost_micro_usd` for runs (Tier 2.7), the `benchmark_results.cost_per_1k_micro_usd` rename is internally consistent — but the user-facing pricing model that benchmark uses already varies per provider; the integer-micro-USD constraint there is over-engineered. Keep `REAL` for benchmark, integer for runs.
`runs` snap_ columns (`unit_id_snap`, `agent_name_snap`)	C-59	If we use soft-delete (`archived_at`) and never hard-delete, snapshots are unnecessary. Drop the columns.
`workflow_pins` content snapshot table	C-71	If we just hash the file at first dispatch and store the hash on the unit (`units.workflow_hash`), we don't need a separate pins table. The hash is enough; the content can be re-read from disk. Simplify.
`agent_capabilities` separate indexed table	C-90	At fleet sizes <100 agents, the JSON-array-LIKE scan is fine. Add the index when you have a measurement showing it's slow.

Suggested v3 milestone breakdown

v3.0 — ship target: ~6–8 weeks

Tier 1.1 Vault (1–2d)
Tier 1.2 sm integration, layered model (2 weeks)
Tier 1.3 spec schema rewrite to 3-table (3d)
Tier 1.4 config alignment (1 week)
Tier 2.2 doc-sync (1 week)
Tier 2.5 turn_status marker (3d)
Tier 2.6 last_error cap (1d)
Tier 2.7 cost_micro_usd (1d)

That's ~5 weeks of work for the must-haves.

v3.1 — ~4 weeks after v3.0

Tier 2.1 persistent agents v1 (2 weeks)
Tier 2.3 intent chapters (1 week)
Tier 2.4 PhaseReview 3-pass (1 week)

v3.2 — when ready

Tier 3 items as appetite allows.

Decisions needed before starting v3.0

sm: replace, layer, or keep? Recommended: layer (sf local cache + sm durable).
Schema: migrate to single units or update spec to 3-table? Recommended: update spec.
Persistent agents in v3.0 or v3.1? Recommended: v3.1 — too much new surface to land alongside Tier 1 + 2.
Does any deployment actually need SSH workers in v3.x? If not, drop §22 from spec entirely; re-add when needed.

21 KiB Raw Blame History Unescape Escape