TODO: simplify md-tracking — drop snapshot blob, accept mid-edit corner

Final settled design: sha + git ref only, no DB content snapshots at all. The mid-edit case (file observed dirty) loses the ability to reconstruct the intermediate working-tree state, but the change- detection signal is preserved and the operator can commit first if intermediate fidelity matters. Trades a corner-case fidelity loss for a much simpler schema and no DB-vs-disk content duplication. Git remains the only version store; the DB row is a pure "where I left off" pointer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 19:49:25 +02:00 · 2026-05-11 19:49:25 +02:00 · eacbbaac82
commit eacbbaac82
parent 76923afb91
6 changed files with 23 additions and 29 deletions
--- a/.sf/wiki/ARCHITECTURE.md
+++ b/.sf/wiki/ARCHITECTURE.md
--- a/.sf/wiki/GLOSSARY.md
+++ b/.sf/wiki/GLOSSARY.md
--- a/.sf/wiki/INDEX.md
+++ b/.sf/wiki/INDEX.md
--- a/.sf/wiki/SUBSYSTEMS.md
+++ b/.sf/wiki/SUBSYSTEMS.md
--- a/.sf/wiki/WORKFLOWS.md
+++ b/.sf/wiki/WORKFLOWS.md
--- a/TODO.md
+++ b/TODO.md
@ -118,45 +118,39 @@ Explicit out of scope:
  is expected, no signal in tracking.
 - `node_modules`, `dist`, vendored copies — irrelevant.

-Storage in `sf.db` — sha + git ref, with **snapshot only as a fallback
-for uncommitted observations**. SF generates many of these files
-itself; storing every version in the DB would duplicate disk + git
-for no benefit. But we still need a reference point to compute diffs
-against — that's the versioning question.
+Storage in `sf.db` — sha + git ref, no content snapshots. Git is the
+version store; the DB is just a pointer:

 ```sql
 CREATE TABLE tracked_md_files (
-  relpath              TEXT PRIMARY KEY,     -- repo-relative path
-  sha256               TEXT NOT NULL,        -- hash of last-seen content
-  size_bytes           INTEGER NOT NULL,
-  last_seen_at         TEXT NOT NULL,
-  last_seen_commit     TEXT,                 -- git SHA1 of HEAD when we saw it
-  uncommitted_snapshot BLOB,                 -- gzipped, ONLY if observed in working tree
-  category             TEXT                  -- 'meta'|'wiki'|'milestone'|'adr'|'plan'
+  relpath           TEXT PRIMARY KEY,  -- repo-relative path
+  sha256            TEXT NOT NULL,     -- hash of last-seen content
+  size_bytes        INTEGER NOT NULL,
+  last_seen_at      TEXT NOT NULL,
+  last_seen_commit  TEXT,              -- git SHA1 of HEAD when observed
+  category          TEXT               -- 'meta'|'wiki'|'milestone'|'adr'|'plan'
 );
 ```

-Versioning + diff source decision tree per file:
+Diff source priority:

-1. **Observed at commit X (file was clean at the time)** →
-   store `last_seen_commit = X`, `uncommitted_snapshot = NULL`. Diff
-   later = `git show X:<path>` vs current. Cheap, no DB blob.
+1. **Tracked + committed at observation** (the common case):
+   `git diff <last_seen_commit> -- <path>` shows everything since.
+   Cheap, no blob, perfect history via `git log <path>` if needed.

-2. **Observed with uncommitted changes (working-tree state at time of
-   observation)** → store `uncommitted_snapshot = gzip(content)`,
-   `last_seen_commit = HEAD-at-the-time-anyway`. Diff later = unpack
-   the snapshot vs current. Necessary because there is no git ref
-   that ever held that exact content.
+2. **Tracked + uncommitted at observation** (mid-edit corner): no git
+   ref points at that exact content. Diff shows "changed since
+   `<last_seen_commit>`" but the prior intermediate working-tree state
+   isn't reconstructable. Acceptable trade-off — the main signal is
+   "changed", and the operator can commit before letting SF observe
+   if intermediate fidelity matters.

-3. **File untracked or in .gitignore** (transient SF state, generated
-   artifacts) → either skip tracking entirely (preferred), or treat
-   it like case 2 (always store snapshot). Don't pretend a git ref
-   exists when it doesn't.
+3. **Untracked / gitignored**: not tracked in this table. SF-generated
+   transient files don't belong in version control or in this audit.

-In practice most md SF deals with is case 1 — committed at
-observation time — so the snapshot blob stays NULL for most rows. The
-DB stays small; the working-tree-edit corner case still has a clean
-diff.
+History per file = `git log <relpath>` (already there, free). SF's DB
+just records "where I left off." No `md_observation_log` history
+table unless someone has a concrete need for an SF-side timeline.

 On session start + each autonomous-cycle entry, walk the configured
 glob set, hash each file, diff against `tracked_md_files.sha256`.