From 76923afb919681e0210a5828b40dd66c852b78a2 Mon Sep 17 00:00:00 2001 From: Mikael Hugo Date: Mon, 11 May 2026 19:46:38 +0200 Subject: [PATCH] TODO: md-tracking needs a version reference, not just a content sha Without storing snapshots we lose the ability to diff against "what SF last saw". The fix is hybrid: store the git commit SHA1 that contained the observed content (cheap, no DB blob), and only fall back to a gzipped snapshot when the file was observed with uncommitted changes (no git ref exists for that exact content). For ".sf/-generated, untracked, in .gitignore" the right answer is to not track them in this table at all. Co-Authored-By: Claude Opus 4.7 (1M context) --- TODO.md | 49 +++++++++++++++++++++++++++++++------------------ 1 file changed, 31 insertions(+), 18 deletions(-) diff --git a/TODO.md b/TODO.md index 68699fd60..ba24fa82b 100644 --- a/TODO.md +++ b/TODO.md @@ -118,32 +118,45 @@ Explicit out of scope: is expected, no signal in tracking. - `node_modules`, `dist`, vendored copies — irrelevant. -Storage in `sf.db` — **shas only, no content snapshots**. SF generates -many of these files itself; caching their contents in the DB would -duplicate disk + git for no benefit: +Storage in `sf.db` — sha + git ref, with **snapshot only as a fallback +for uncommitted observations**. SF generates many of these files +itself; storing every version in the DB would duplicate disk + git +for no benefit. But we still need a reference point to compute diffs +against — that's the versioning question. ```sql CREATE TABLE tracked_md_files ( - relpath TEXT PRIMARY KEY, -- repo-relative path - sha256 TEXT NOT NULL, -- hash of last-seen content - size_bytes INTEGER NOT NULL, - last_seen_at TEXT NOT NULL, - category TEXT -- 'meta'|'wiki'|'milestone'|'adr'|'plan' + relpath TEXT PRIMARY KEY, -- repo-relative path + sha256 TEXT NOT NULL, -- hash of last-seen content + size_bytes INTEGER NOT NULL, + last_seen_at TEXT NOT NULL, + last_seen_commit TEXT, -- git SHA1 of HEAD when we saw it + uncommitted_snapshot BLOB, -- gzipped, ONLY if observed in working tree + category TEXT -- 'meta'|'wiki'|'milestone'|'adr'|'plan' ); ``` -For diff source, use **git** (these are all tracked files; if they're -not, the agent should add them or skip tracking that path): +Versioning + diff source decision tree per file: -``` -git show HEAD: ← what was committed - ← what's on disk now -diff the two ← what changed since the last commit -``` +1. **Observed at commit X (file was clean at the time)** → + store `last_seen_commit = X`, `uncommitted_snapshot = NULL`. Diff + later = `git show X:` vs current. Cheap, no DB blob. -This naturally handles "the operator edited but hasn't committed yet" -(diff shows the working-tree change) and "another agent committed and -SF wasn't running" (diff shows the new commit). +2. **Observed with uncommitted changes (working-tree state at time of + observation)** → store `uncommitted_snapshot = gzip(content)`, + `last_seen_commit = HEAD-at-the-time-anyway`. Diff later = unpack + the snapshot vs current. Necessary because there is no git ref + that ever held that exact content. + +3. **File untracked or in .gitignore** (transient SF state, generated + artifacts) → either skip tracking entirely (preferred), or treat + it like case 2 (always store snapshot). Don't pretend a git ref + exists when it doesn't. + +In practice most md SF deals with is case 1 — committed at +observation time — so the snapshot blob stays NULL for most rows. The +DB stays small; the working-tree-edit corner case still has a clean +diff. On session start + each autonomous-cycle entry, walk the configured glob set, hash each file, diff against `tracked_md_files.sha256`.