diff --git a/copilot-thoughts.md b/copilot-thoughts.md index 0df1aaf5d..93dd4f2e9 100644 --- a/copilot-thoughts.md +++ b/copilot-thoughts.md @@ -455,6 +455,256 @@ This complements, not replaces: Copilot's `/tasks` and `/session` are less powerful internally, but clearer as control surfaces. SF should keep its deeper state and expose it better. +## Actual Source Pass: Awesome CLI Agent Repos + +Checked locally under `/tmp/sf-agent-research`: + +- `bradAGI/awesome-cli-coding-agents` +- `plandex-ai/plandex` +- `leonardcser/smelt` +- `mikeyobrien/ralph-orchestrator` +- `subsy/ralph-tui` +- `oxgeneral/ORCH` +- `LucasDuys/forge` +- `ramarlina/agx` +- `youwangd/SageCLI` +- `jcast90/relay` +- `basilisk-labs/agentplane` +- `amaar-mc/wit` +- `fastxyz/skill-optimizer` +- `0xmariowu/AgentLint` +- `ZENG3LD/gate4agent` + +`arosstale/pi-builder` was listed but the GitHub repository was not found when +cloned on 2026-05-08. + +### Smelt + +Smelt's source has four modes: + +```text +normal -> plan -> apply -> yolo +``` + +It also has separate reasoning effort: + +```text +off | low | medium | high | max +``` + +Useful: + +- mode cycling is explicit and configurable +- permissions differ by mode +- read-only commands are allowed, writes usually ask, deny wins +- approval scopes are explicit: once, session, workspace +- workspace approvals persist under a workspace hash + +Do not copy: + +- `yolo` as a name +- putting work kind and trust level into one mode axis + +SF should keep Smelt's visible cycling and approval scopes, but preserve SF's +separate axes: `workMode`, `runControl`, `permissionProfile`, and `modelMode`. + +### ORCH + +ORCH has the cleanest small task state machine: + +```text +todo -> in_progress -> review -> done + \-> retrying -> in_progress + \-> failed +review -> todo +* -> cancelled +``` + +It also keeps runtime state separately: + +- `running` +- `claimed` +- `retry_queue` +- total run/task/token/runtime stats + +Useful for SF: + +- `/tasks` should show both durable task status and ephemeral running state +- successful completion should pass through review, even when auto-approved +- dependency blockers should be computed, not implied from ordering +- retrying should be an explicit state, not hidden inside logs + +### AgentPlane + +AgentPlane's strongest idea is schema-first task artifacts. Task README +frontmatter includes: + +- `risk_level` +- `status` +- `depends_on` +- `task_kind` +- `mutation_scope` +- `risk_flags` +- `blueprint_request` +- `verify` +- `plan_approval` +- `verification` +- `runner` + +Its workflow file also makes operational policy explicit: + +- workflow mode +- status commit policy +- workspace isolation +- retry policy +- scheduler concurrency +- required evaluator checks +- event log location + +Useful for SF: + +- task artifacts should have schema-backed frontmatter, not loose markdown +- plan approval and verification state deserve durable fields +- mutation scope and risk flags should feed `permissionProfile` +- workflow policy should be inspectable by `/status` and `/tasks` + +### Relay + +Relay's useful concepts: + +- a channel is the workspace for one piece of work +- tickets are parallelizable units with dependency DAGs, retry budgets, + specialty tags, optional repo routing, and verification commands +- decisions are first-class durable records +- crosslink lets agents discover and message other sessions +- complexity tiers drive approval behavior +- CLI/TUI/GUI all read the same state + +Useful for SF: + +- keep decisions as first-class records, not buried in summaries +- remote steering should become full-session steering and cross-session + messaging, not only remote questions +- multi-repo work needs explicit repo routing on tasks +- one state store should power TUI, web, headless, and RPC + +### Ralph + +Ralph's hat system is useful as a coordination topology: + +- hats declare triggers, publishes, instructions, backend overrides, max + activations, and disallowed tools +- events flow through a bus +- scope violations are detected when hats publish undeclared topics +- exhaustion emits explicit events + +Useful for SF: + +- specialized helpers should declare trigger/publish contracts +- helper activation should have max activation limits +- helper output should be checked against declared output topics +- mode transitions can be modeled as events, not ad hoc flags + +### Sage + +Sage's real value is runtime-neutral orchestration: + +- agents are processes +- messages are files +- tasks are templates with frontmatter +- plans decompose into dependency waves +- tasks in a wave execute in parallel +- resume skips done tasks and resets stale running tasks +- runtime fallback is explicit +- bench-as-code compares actual agent CLIs on actual tasks + +Useful for SF: + +- `/tasks` should be file/DB-backed enough that headless tools can read it + without attaching to a live TUI +- dependency waves should be visible in planning output +- stale running work should be reset or surfaced clearly on resume +- model/provider benchmarking should use actual SF workflows, not isolated + model prompts + +### AGX + +AGX has useful low-level patterns: + +- graph scheduler with hard, soft, failure, and always dependency conditions +- max concurrent work slots +- checkpoints with patch files and bounded history +- deterministic verify gate before LLM fallback +- repeated verification failure count that forces action + +Useful for SF: + +- dependency edges should support more than "depends on success" +- checkpoints should store patch references and bounded summaries +- deterministic verification should always run before semantic/LLM review +- repeated verify failures should force a mode transition to `repair` or + `review`, not keep retrying indefinitely + +### Wit + +Wit is the strongest coordination pattern for parallel edits: + +- agents declare intent before editing +- agents acquire symbol-level locks +- conflicts are warnings, not always hard blocks +- contracts can be enforced by git hooks +- Tree-sitter provides symbol ranges and call edges +- a `coordinate` skill auto-loads when `.wit/` exists + +Useful for SF: + +- parallel SF workers should declare intent before editing +- conflict detection should eventually be symbol-aware, not only file-aware +- warnings can steer agents away from collisions without freezing work +- accepted interface contracts should be enforceable before commit + +### skill-optimizer + +Skill optimizer has the best pattern for making skills real: + +- a case is a user-like task plus deterministic graders +- a suite is a case/model matrix +- references are copied into `/work` +- the agent sees only `/work`, not graders or hidden answers +- graders inspect files, artifacts, `answer.json`, `trace.jsonl`, and result + state +- failed trials preserve workspace for debugging + +Useful for SF: + +- auto-created skills need eval cases +- skill acceptance should be grader-backed, not vibes-backed +- negative cases should check that irrelevant skills were not loaded +- skill optimization should test across model modes/providers + +### Plandex And Forge Loop + +Plandex reinforces: + +- chat/tell split +- configurable autonomy levels +- cumulative diff sandbox before applying changes +- model packs for planning vs execution + +Forge Loop reinforces: + +- R-numbered acceptance criteria +- task DAGs with tiered parallelism +- per-task worktrees +- per-task and session token budgets +- structural completion markers +- backpropagation from runtime failure to spec gap +- state on disk as the recovery source + +SF already has many of these ideas. The part to tighten is the explicit product +surface: direct commands, visible modes, `/tasks`, schema-backed state, and +skill evals. + ## Status And Mode Badge The active state should always be visible, especially during full autonomy. @@ -502,9 +752,14 @@ Still needed: - make `--autonomous` chain into direct `/autonomous` - add visible mode/status surface for TUI and web - expose autonomous continuation limits in settings and status -- add `/tasks` as the unified background work surface +- add `/tasks` as the unified background work surface with durable task state, + ephemeral running state, retries, blockers, checkpoints, budget, and steering - make `repair` a first-class workflow over doctor - add policy-aware project skill suggestion/generation +- add skill eval cases for generated project skills +- add schema-backed task/frontmatter fields for risk, mutation scope, + verification, plan approval, and runner status +- add intent/claim records for parallel workers before editing - audit subagent provider/model/permission inheritance - audit remote steering as a full-session steering surface, not only remote question delivery @@ -546,3 +801,242 @@ sf headless --autonomous ... The target model is simple: direct commands for humans, headless commands for machines, durable state for autonomous execution, and explicit axes for mode, control, trust, model posture, and surface. + +## Runtime Target: Node 26 + +SF should treat Node 26 as the target runtime, with Node 24 kept as the current +compatibility floor until the Node 26 lane is proven clean. + +Source notes checked 2026-05-08: + +- Node 24 is the current LTS line and this repo already requires `>=24.15.0`. +- Node 25 is a short-lived current line. It is useful as a compatibility probe, + but not a target. +- Node 26 is the next meaningful target: current now, LTS-bound, and useful for + SF's own runtime model. +- Bun is closer to Node every release and supports many Node APIs plus + Node-API, but its compatibility target and partial API areas do not match + SF's risk surface yet. +- Deno supports Node/npm compatibility, package.json, local node_modules, and + Node-API addons with FFI permission, but that means SF would still be running + a Node-compatibility workload. +- LLRT is experimental and serverless-oriented, not a local CLI/runtime fit. + +### Why Node 26 Makes SF Stronger + +Node 26 is not just "newer Node." It gives SF a better platform for long-running +agent work: + +- `Temporal` is enabled by default. +- V8 14.6 is the JavaScript engine baseline. +- Undici 8 is the HTTP/fetch baseline. +- Node 26 removes and deprecates more legacy APIs, so it hardens SF against old + loader, stream, HTTP, crypto, and dependency assumptions. + +### Temporal Is More Than Better Dates + +Temporal gives SF the vocabulary it already needs for durable autonomous work. + +Important Temporal concepts: + +- `Temporal.Instant`: an exact point in history. Use for journal events, + checkpoint timestamps, lock leases, provider call start/end, and trace order. +- `Temporal.ZonedDateTime`: an exact instant plus time zone and calendar. Use + for reminders, schedules, adoption reviews, audits, and "run this at local + business time" semantics. +- `Temporal.PlainDate`: a calendar date without time or time zone. Use for + daily reports, milestone review dates, and human-facing due dates. +- `Temporal.PlainTime`: a wall-clock time without date or zone. Use for + recurring "at 09:00" style policies. +- `Temporal.PlainDateTime`: a date and wall-clock time before binding it to a + zone. Use only when the zone is deliberately chosen later. +- `Temporal.Duration`: a typed amount of time. Use for budgets, leases, + cooldowns, retry delays, schedule offsets, and age checks. + +That split matters because SF currently has many different meanings hidden +behind timestamps and strings: + +- exact event ordering +- local user reminders +- project schedule dates +- lease expiry +- retry backoff +- adoption review windows +- elapsed runtime +- "next business day" style planning + +`Date` collapses those into one weak type. Temporal lets SF store and validate +the real intent. + +### SF Runtime Places That Should Use Temporal + +Use Temporal first in the areas where wrong time semantics create real +operational mistakes: + +- `sf schedule`: due dates, relative offsets, local-time reminders, audit + windows, and recurrence-ready storage. +- autonomous locks and leases: exact `Instant` plus typed `Duration`, not + implicit millisecond math scattered through code. +- journals and traces: exact event instants with stable ordering and explicit + serialization. +- session reports: elapsed durations and grouped daily summaries without local + timezone drift. +- adoption reviews and decision audits: calendar dates and wall-clock reminders + that survive DST and timezone changes. +- background work surface: task age, stale-running detection, retry-after, and + next-action time should be typed. + +### Temporal Design Rule For SF + +Store the semantic type, not just the formatted string: + +```text +event happened exactly now -> Instant +run at 09:00 in Europe/Oslo -> ZonedDateTime or PlainTime + timeZone +review on 2026-06-01 -> PlainDate +retry after 30 minutes -> Duration +lease expires at exact timestamp -> Instant +``` + +Serialization should stay explicit and boring: + +- store ISO strings plus a field that says which Temporal type they represent +- include timezone when wall-clock semantics matter +- do not infer local timezone at read time unless the record explicitly asks + for it +- validate schedule and lease records at DB boundaries + +### Node 26 Adoption Path + +Target policy: + +```text +current compatibility floor: Node 24.15+ +internal target runtime: Node 26 +canonical future baseline: Node 26 after canary is clean +Node 25: skip except quick probes +``` + +### Runtime Alternatives + +Other JavaScript runtimes are useful comparators, but none should replace Node +as SF's primary runtime right now. + +SF's current runtime shape is Node-native: + +- npm workspaces and `package-lock.json` +- Next.js standalone web host +- Vitest and Node test-runner compatibility scripts +- Rust N-API `.node` addons +- `node-pty` native assets in the web host +- `node:` built-ins across CLI, scripts, packages, and web services +- child process, TTY, stream, module loader, and extension-loader behavior +- installed runtime sync into `~/.sf/agent` + +#### Bun + +Bun is the strongest speed and developer-experience competitor. + +Useful: + +- fast package install and script startup +- broad Node API compatibility +- built-in TypeScript, test runner, shell, SQLite, YAML, TOML, JSONL, and other + convenience APIs +- Node-API support is substantial enough to use as a compatibility probe + +Not primary for SF: + +- Bun's own docs say compatibility reflects Node v23, while SF is targeting + Node 26. +- Some core APIs are partial or behaviorally different: `child_process`, module + loader hooks, `node:v8`, `node:test`, `node:sqlite`, `worker_threads`, and + inspector/debugger areas are not exact Node. +- SF's highest-risk paths are exactly the places where "almost Node" can hurt: + TTY, child processes, native addons, Next standalone output, loaders, and + extension runtime. + +Decision: use Bun only for optional speed probes or isolated tooling. Do not +make it the SF runtime until full `npm test`, web build, native build, smoke +tests, and installed extension runtime all pass under Bun without special +cases. + +#### Deno + +Deno has the best security and integrated-toolchain story. + +Useful: + +- explicit permissions model +- first-class TypeScript and web standards +- npm/package.json compatibility +- Node-API support when local `node_modules` and FFI permission are enabled +- good target for thinking about sandboxing and permission profiles + +Not primary for SF: + +- Deno still becomes a Node-compatibility mode for a repo like SF. +- Deno docs recommend local `node_modules` for frameworks like Next.js and for + Node-API addons, which means SF would keep most Node/npm complexity anyway. +- Native addons require local `node_modules` plus `--allow-ffi`. +- The value would be security posture and packaging experiments, not simpler + runtime execution. + +Decision: study Deno for permission-profile design and maybe future packaged +headless workers. Do not switch the core SF runtime to Deno. + +#### LLRT, WinterJS, Edge Runtimes + +These are not fits for SF's primary runtime. + +Useful: + +- serverless cold-start research +- constrained worker/edge execution ideas +- tiny isolated helper tasks + +Not primary for SF: + +- SF is a long-running local CLI/runtime, not a small stateless Lambda handler. +- SF needs native addons, process control, TTY, filesystem state, git, shell, + Next web host, and Node-compatible package behavior. +- LLRT is explicitly experimental and evaluation-oriented. + +Decision: ignore as primary runtime. Only revisit for isolated future worker +surfaces. + +### Runtime Decision + +Node 26 is the target because SF is a Node-native agent runtime, not a generic +JavaScript app. + +Use alternatives this way: + +```text +Node 26 -> primary internal target and future baseline +Bun -> speed/compatibility probe, not runtime +Deno -> permission/sandbox design reference, not runtime +LLRT -> ignore except tiny serverless worker research +``` + +The rule is simple: if a runtime cannot run the exact SF stack without special +cases, it is not stronger for SF. Node 26 makes the existing SF stack stronger; +alternative runtimes mostly make a different stack. + +Required Node 26 gate: + +```text +node@26 --version +npm run lint +npm run typecheck:extensions +npm run build +npm test +sf --version +sf --help +sf --print "ping" +``` + +If Node 26 passes those gates, SF should run itself on Node 26 internally even +before raising public `engines.node`. Once stable, raise the repo baseline and +start replacing fragile `Date`/millisecond logic with Temporal in the schedule, +lease, journal, and background task surfaces.