perf(provider): optimize RocksDB history pruning with changeset-based approach#20674
perf(provider): optimize RocksDB history pruning with changeset-based approach#20674meetrick wants to merge 8 commits intoparadigmxyz:mainfrom
Conversation
|
@Rjected shouldn't this work w static file changesets too? |
Yes, I think you would just need to use static file changeset iterators for this |
… approach Use MDBX changesets to identify affected (address, storage_key) pairs instead of iterating the entire StoragesHistory/AccountsHistory tables. This significantly improves pruning performance for mainnet-scale databases. Closes paradigmxyz#20417 Signed-off-by: Hwangjae Lee <meetrick@gmail.com>
Verify no excess entries remain after optimized pruning via `last()`. Falls back to full scan if entries are missed due to incomplete changesets. Signed-off-by: Hwangjae Lee <meetrick@gmail.com>
Add tests verifying the defensive check catches RocksDB entries missed by changeset-based pruning when MDBX data is incomplete. - test_storages_history_defensive_check_catches_missed_entries - test_accounts_history_defensive_check_catches_missed_entries Signed-off-by: Hwangjae Lee <meetrick@gmail.com>
Signed-off-by: Hwangjae Lee <meetrick@gmail.com>
1394843 to
443efc6
Compare
|
I took a closer look at the codebase and here's what I found.
In this PR, the optimized path relies on MDBX changesets, with a fallback to a full table scan. Could you clarify what you meant by "static file changeset iterators" in this context?
Thanks! |
Signed-off-by: Hwangjae Lee <meetrick@gmail.com>
3722adc to
95524ed
Compare
|
@Rjected @gakonst I see that #18882 has been merged. I took a look at the code, and it seems that no additional refactoring is needed for this PR.
Please let me know if I’m missing anything, and I’d appreciate your thoughts on this. |
The iter_from method was incorrectly accessing `.db` field directly on RocksDBProviderInner, which is an enum and doesn't expose db as a public field. This was introduced during merge conflict resolution. Use the existing iterator_cf() helper method that properly handles both ReadWrite and ReadOnly variants. Signed-off-by: Hwangjae Lee <meetrick@gmail.com>
Fixes several issues in changeset-based pruning: 1. Use HashSet instead of Vec for O(1) deduplication (was O(n²)) 2. Fix defensive check - the last() approach only checks lexicographically last entry, missing entries for earlier addresses. Now always does full verification scan after optimized path. 3. Remove verbose AI-generated comments that just restate the code 4. Simplify doc comments to be concise Net reduction: 60 lines of code (-100/+40) Addresses review feedback from PR paradigmxyz#20674
|
hey @meetrick for this pr, and rocks db we have decided to take inhouse. sorry about that |
|
@yongkangc Understood, thanks for the update. |
Summary
Optimizes
StoragesHistoryandAccountsHistorypruning in RocksDB by usingMDBX changesets instead of iterating the entire history tables.
Problem
During consistency checks (e.g. after crash recovery), RocksDB history pruning
currently scans the entire
StoragesHistory/AccountsHistorytables todelete entries above a given block.
On mainnet-scale databases, this results in unnecessary full table scans and
slow startup times.
Solution
Use MDBX
StorageChangeSetsandAccountChangeSetsto identify only the(address, key) pairs that changed in the excess block range, and prune
corresponding history entries using
iter_fromseeks.If changeset data is unavailable or incomplete, the logic safely falls back
to a full table scan to preserve correctness.
Key Changes
iter_fromtoRocksDBProviderfor efficient seek-based iterationTesting
test_storages_history_defensive_check_catches_missed_entriestest_accounts_history_defensive_check_catches_missed_entries