TODO: md-tracking needs a version reference, not just a content sha
Some checks are pending
CI / detect-changes (push) Waiting to run
CI / docs-check (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / build (push) Blocked by required conditions
CI / integration-tests (push) Blocked by required conditions
CI / windows-portability (push) Blocked by required conditions
CI / rtk-portability (linux, blacksmith-4vcpu-ubuntu-2404) (push) Blocked by required conditions
CI / rtk-portability (macos, macos-15) (push) Blocked by required conditions
CI / rtk-portability (windows, blacksmith-4vcpu-windows-2025) (push) Blocked by required conditions

Without storing snapshots we lose the ability to diff against
"what SF last saw". The fix is hybrid: store the git commit SHA1
that contained the observed content (cheap, no DB blob), and only
fall back to a gzipped snapshot when the file was observed with
uncommitted changes (no git ref exists for that exact content).

For ".sf/-generated, untracked, in .gitignore" the right answer is
to not track them in this table at all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mikael Hugo 2026-05-11 19:46:38 +02:00
parent 296054b1d4
commit 76923afb91

49
TODO.md
View file

@ -118,32 +118,45 @@ Explicit out of scope:
is expected, no signal in tracking.
- `node_modules`, `dist`, vendored copies — irrelevant.
Storage in `sf.db`**shas only, no content snapshots**. SF generates
many of these files itself; caching their contents in the DB would
duplicate disk + git for no benefit:
Storage in `sf.db` — sha + git ref, with **snapshot only as a fallback
for uncommitted observations**. SF generates many of these files
itself; storing every version in the DB would duplicate disk + git
for no benefit. But we still need a reference point to compute diffs
against — that's the versioning question.
```sql
CREATE TABLE tracked_md_files (
relpath TEXT PRIMARY KEY, -- repo-relative path
sha256 TEXT NOT NULL, -- hash of last-seen content
size_bytes INTEGER NOT NULL,
last_seen_at TEXT NOT NULL,
category TEXT -- 'meta'|'wiki'|'milestone'|'adr'|'plan'
relpath TEXT PRIMARY KEY, -- repo-relative path
sha256 TEXT NOT NULL, -- hash of last-seen content
size_bytes INTEGER NOT NULL,
last_seen_at TEXT NOT NULL,
last_seen_commit TEXT, -- git SHA1 of HEAD when we saw it
uncommitted_snapshot BLOB, -- gzipped, ONLY if observed in working tree
category TEXT -- 'meta'|'wiki'|'milestone'|'adr'|'plan'
);
```
For diff source, use **git** (these are all tracked files; if they're
not, the agent should add them or skip tracking that path):
Versioning + diff source decision tree per file:
```
git show HEAD:<relpath> ← what was committed
<relpath> ← what's on disk now
diff the two ← what changed since the last commit
```
1. **Observed at commit X (file was clean at the time)**
store `last_seen_commit = X`, `uncommitted_snapshot = NULL`. Diff
later = `git show X:<path>` vs current. Cheap, no DB blob.
This naturally handles "the operator edited but hasn't committed yet"
(diff shows the working-tree change) and "another agent committed and
SF wasn't running" (diff shows the new commit).
2. **Observed with uncommitted changes (working-tree state at time of
observation)** → store `uncommitted_snapshot = gzip(content)`,
`last_seen_commit = HEAD-at-the-time-anyway`. Diff later = unpack
the snapshot vs current. Necessary because there is no git ref
that ever held that exact content.
3. **File untracked or in .gitignore** (transient SF state, generated
artifacts) → either skip tracking entirely (preferred), or treat
it like case 2 (always store snapshot). Don't pretend a git ref
exists when it doesn't.
In practice most md SF deals with is case 1 — committed at
observation time — so the snapshot blob stays NULL for most rows. The
DB stays small; the working-tree-edit corner case still has a clean
diff.
On session start + each autonomous-cycle entry, walk the configured
glob set, hash each file, diff against `tracked_md_files.sha256`.