Problem
dolt_diff_<table> full-history queries walk every commit in the graph. For tables that change infrequently, most of that work is wasted.
What's already done
- Streaming results:
dolt_diff_<table> uses prollyDiffIterOpen / prollyDiffIterStep — a lazy iterator over the prolly tree diff. Results flow through xNext one change at a time with no intermediate buffer.
- Skip unchanged commits:
appendCurrentDiffPair (doltlite_diff_table.c:377-379) compares the table root + schema hash between each commit and its child. If both match, the commit is skipped and no prolly diff iterator is opened.
Remaining bottleneck
Even with the skip, the walker still loads every commit in the graph to check: doltliteLoadCommit + doltliteLoadCatalog + linear scan for the table name. For a table that changed in 100 out of 10K commits, that's 9900 unnecessary commit loads + catalog deserializations.
Proposed fix: per-table commit index
At commit time, for each table whose root hash changed from parent, write a (table_id, commit_hash) entry into a side index in the chunk store. Then dolt_diff_<table> walks only the indexed commits instead of every commit in the graph.
The index is small (one 20-byte hash per table per commit that touched it) and the write-time cost is one root-hash comparison per table per commit — which we already compute during the commit flow anyway.
This is the same optimization git log -- path uses: git can skip commits where the subtree hash for the directory didn't change because it has tree objects with hashes at each level.
Problem
dolt_diff_<table>full-history queries walk every commit in the graph. For tables that change infrequently, most of that work is wasted.What's already done
dolt_diff_<table>usesprollyDiffIterOpen/prollyDiffIterStep— a lazy iterator over the prolly tree diff. Results flow through xNext one change at a time with no intermediate buffer.appendCurrentDiffPair(doltlite_diff_table.c:377-379) compares the table root + schema hash between each commit and its child. If both match, the commit is skipped and no prolly diff iterator is opened.Remaining bottleneck
Even with the skip, the walker still loads every commit in the graph to check:
doltliteLoadCommit+doltliteLoadCatalog+ linear scan for the table name. For a table that changed in 100 out of 10K commits, that's 9900 unnecessary commit loads + catalog deserializations.Proposed fix: per-table commit index
At commit time, for each table whose root hash changed from parent, write a (table_id, commit_hash) entry into a side index in the chunk store. Then
dolt_diff_<table>walks only the indexed commits instead of every commit in the graph.The index is small (one 20-byte hash per table per commit that touched it) and the write-time cost is one root-hash comparison per table per commit — which we already compute during the commit flow anyway.
This is the same optimization
git log -- pathuses: git can skip commits where the subtree hash for the directory didn't change because it has tree objects with hashes at each level.