TODO: md-tracking needs a version reference, not just a content sha

Without storing snapshots we lose the ability to diff against "what SF last saw". The fix is hybrid: store the git commit SHA1 that contained the observed content (cheap, no DB blob), and only fall back to a gzipped snapshot when the file was observed with uncommitted changes (no git ref exists for that exact content). For ".sf/-generated, untracked, in .gitignore" the right answer is to not track them in this table at all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 19:46:38 +02:00 · 2026-05-11 19:46:38 +02:00 · 76923afb91
commit 76923afb91
parent 296054b1d4
1 changed files with 31 additions and 18 deletions
--- a/TODO.md
+++ b/TODO.md
@ -118,32 +118,45 @@ Explicit out of scope:
  is expected, no signal in tracking.
 - `node_modules`, `dist`, vendored copies — irrelevant.

-Storage in `sf.db` — **shas only, no content snapshots**. SF generates
-many of these files itself; caching their contents in the DB would
-duplicate disk + git for no benefit:
+Storage in `sf.db` — sha + git ref, with **snapshot only as a fallback
+for uncommitted observations**. SF generates many of these files
+itself; storing every version in the DB would duplicate disk + git
+for no benefit. But we still need a reference point to compute diffs
+against — that's the versioning question.

 ```sql
 CREATE TABLE tracked_md_files (
-  relpath        TEXT PRIMARY KEY,        -- repo-relative path
-  sha256         TEXT NOT NULL,           -- hash of last-seen content
-  size_bytes     INTEGER NOT NULL,
-  last_seen_at   TEXT NOT NULL,
-  category       TEXT                     -- 'meta'|'wiki'|'milestone'|'adr'|'plan'
+  relpath              TEXT PRIMARY KEY,     -- repo-relative path
+  sha256               TEXT NOT NULL,        -- hash of last-seen content
+  size_bytes           INTEGER NOT NULL,
+  last_seen_at         TEXT NOT NULL,
+  last_seen_commit     TEXT,                 -- git SHA1 of HEAD when we saw it
+  uncommitted_snapshot BLOB,                 -- gzipped, ONLY if observed in working tree
+  category             TEXT                  -- 'meta'|'wiki'|'milestone'|'adr'|'plan'
 );
 ```

-For diff source, use **git** (these are all tracked files; if they're
-not, the agent should add them or skip tracking that path):
+Versioning + diff source decision tree per file:

-```
-git show HEAD:<relpath>     ← what was committed
-<relpath>                   ← what's on disk now
-diff the two                ← what changed since the last commit
-```
+1. **Observed at commit X (file was clean at the time)** →
+   store `last_seen_commit = X`, `uncommitted_snapshot = NULL`. Diff
+   later = `git show X:<path>` vs current. Cheap, no DB blob.

-This naturally handles "the operator edited but hasn't committed yet"
-(diff shows the working-tree change) and "another agent committed and
-SF wasn't running" (diff shows the new commit).
+2. **Observed with uncommitted changes (working-tree state at time of
+   observation)** → store `uncommitted_snapshot = gzip(content)`,
+   `last_seen_commit = HEAD-at-the-time-anyway`. Diff later = unpack
+   the snapshot vs current. Necessary because there is no git ref
+   that ever held that exact content.
+
+3. **File untracked or in .gitignore** (transient SF state, generated
+   artifacts) → either skip tracking entirely (preferred), or treat
+   it like case 2 (always store snapshot). Don't pretend a git ref
+   exists when it doesn't.
+
+In practice most md SF deals with is case 1 — committed at
+observation time — so the snapshot blob stays NULL for most rows. The
+DB stays small; the working-tree-edit corner case still has a clean
+diff.

 On session start + each autonomous-cycle entry, walk the configured
 glob set, hash each file, diff against `tracked_md_files.sha256`.