Skip to content

Diff performance: add per-table commit index to skip unchanged-table commit loads #224

@timsehn

Description

@timsehn

Problem

dolt_diff_<table> full-history queries walk every commit in the graph. For tables that change infrequently, most of that work is wasted.

What's already done

  • Streaming results: dolt_diff_<table> uses prollyDiffIterOpen / prollyDiffIterStep — a lazy iterator over the prolly tree diff. Results flow through xNext one change at a time with no intermediate buffer.
  • Skip unchanged commits: appendCurrentDiffPair (doltlite_diff_table.c:377-379) compares the table root + schema hash between each commit and its child. If both match, the commit is skipped and no prolly diff iterator is opened.

Remaining bottleneck

Even with the skip, the walker still loads every commit in the graph to check: doltliteLoadCommit + doltliteLoadCatalog + linear scan for the table name. For a table that changed in 100 out of 10K commits, that's 9900 unnecessary commit loads + catalog deserializations.

Proposed fix: per-table commit index

At commit time, for each table whose root hash changed from parent, write a (table_id, commit_hash) entry into a side index in the chunk store. Then dolt_diff_<table> walks only the indexed commits instead of every commit in the graph.

The index is small (one 20-byte hash per table per commit that touched it) and the write-time cost is one root-hash comparison per table per commit — which we already compute during the commit flow anyway.

This is the same optimization git log -- path uses: git can skip commits where the subtree hash for the directory didn't change because it has tree objects with hashes at each level.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions