Skip to content

Conversation

amesgen
Copy link
Member

@amesgen amesgen commented May 19, 2025

This is in preparation for #1424

This PR is intended to be reviewed commit-by-commit.

Currently, we prune the LedgerDB (ie remove all but the last k+1 states) every time we adopt a longer chain. This means that we can not rely on the fact that other threads (like the copyAndSnapshot ChainDB background) actually observe all immutable ledger states, just as described in the caveats of our Watcher abstraction.

However, a predictable ledger snapshotting rule (#1424) requires this property; otherwise, when the node is under high load and/or we are adopting multiple blocks in quick succession, the node might not be able to create a snapshot for its desired block.

This PR changes this fact: Now, when adopting new blocks, the LedgerDB is not immediately pruned. Instead, the a new dedicated background thread for ledger maintenance tasks (flushing/snapshotting/garbage collection) in the ChainDB will periodically (on every new immutable block) wake up and (in particular) garbage collect the LedgerDB based on a slot number.

Also, this makes the semantics more consistent with the existing garbage collection of previously-applied blocks in the LedgerDB, and also with how the ChainDB works, where we also don't immediately delete blocks from the VolatileDB once they are buried beneath k+1 blocks.

See #1513 (comment) for benchmarks demonstrating that the peak memory usage does not increase while syncing (where we now briefly might hold more than k+1 ledger states in memory).

@amesgen amesgen changed the base branch from cardano-node-10.4-backports to main May 20, 2025 15:03
@amesgen amesgen force-pushed the amesgen/ledgerdb-garbage-collect-states branch 2 times, most recently from 8b48bb3 to 045f1cc Compare May 20, 2025 15:15
@amesgen amesgen changed the base branch from main to amesgen/v2-ledgerseq-close May 20, 2025 15:15
@amesgen amesgen force-pushed the amesgen/ledgerdb-garbage-collect-states branch 4 times, most recently from 13e5533 to 68402ed Compare May 20, 2025 17:25
Copy link
Contributor

@jasagredo jasagredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@amesgen
Copy link
Member Author

amesgen commented Jun 5, 2025

Sync benchmarks are looking good (mainnet, first 1e6 slots/blocks):

sync-inmem

LMDB benchmark (of course, this is a bit degenerate as Byron doesn't have tables, but this still serves as a regression test for the DbChangelog aspects which are touched by this PR).

sync-lmdb

Note that d1b6215 is crucial; otherwise, there is a significant (2x) regression in max heap size.

Base automatically changed from amesgen/v2-ledgerseq-close to main June 5, 2025 21:18
@amesgen amesgen force-pushed the amesgen/ledgerdb-garbage-collect-states branch 2 times, most recently from 2e01b1c to b9e25f5 Compare June 10, 2025 15:47
@amesgen amesgen changed the base branch from main to amesgen/ledgerdb-v2-locking June 10, 2025 15:49
@amesgen amesgen force-pushed the amesgen/ledgerdb-v2-locking branch from 19faf20 to 4010598 Compare June 10, 2025 17:54
@amesgen amesgen force-pushed the amesgen/ledgerdb-garbage-collect-states branch from b9e25f5 to 894940c Compare June 10, 2025 18:09
Base automatically changed from amesgen/ledgerdb-v2-locking to main June 11, 2025 09:07
@amesgen amesgen force-pushed the amesgen/ledgerdb-garbage-collect-states branch from 894940c to a8fa7e2 Compare June 30, 2025 08:11
@amesgen amesgen marked this pull request as ready for review June 30, 2025 08:22
@amesgen amesgen force-pushed the amesgen/ledgerdb-garbage-collect-states branch from b503dc3 to 6c78fad Compare July 2, 2025 08:14
@amesgen amesgen changed the base branch from main to amesgen/ledgerdb-state-machine-precondition-bug July 2, 2025 08:16
@amesgen
Copy link
Member Author

amesgen commented Jul 2, 2025

Thanks for the great reviews, I hope I addressed your comments. Interesting changes:

Base automatically changed from amesgen/ledgerdb-state-machine-precondition-bug to main July 2, 2025 14:48
@amesgen amesgen force-pushed the amesgen/ledgerdb-garbage-collect-states branch 2 times, most recently from 48bb1fe to ad7acfa Compare July 9, 2025 12:26
@amesgen amesgen force-pushed the amesgen/ledgerdb-garbage-collect-states branch from ad7acfa to d791dfb Compare August 4, 2025 11:27
@amesgen amesgen force-pushed the amesgen/ledgerdb-garbage-collect-states branch from d791dfb to 247c489 Compare August 8, 2025 14:13
It is not necessary to perform the garbage collection of the LedgerDB and the
map of invalid blocks in the same STM transaction. In the past, this was
important, but it is not anymore, see
#1507.
Primarily, this is an optimization to reduce the maximum memory usage (more
relevant with the in-memory backend) when pruning happens on garbage collection
instead of while adding new blocks to the LedgerDB, see the added commit and the
benchmark in the pull request. Previously, LedgerDB garbage collection happened
as part of VolatileDB garbage collection, which was intentionally rate-limited.

Also, it resolves the current (somewhat weird) behavior that we do not copy any
blocks to the ImmutableDB when we are taking a snapshot (which can take >2
minutes), and consequently also not garbage-collecting the VolatileDB.

It also synergizes with the planned feature to add a random delay when taking
snapshots.
Also make sure to account for the fact that the DbChangelog might have gotten
pruned between opening and committing the forker.
regarding the previous few commits
@amesgen amesgen force-pushed the amesgen/ledgerdb-garbage-collect-states branch from 247c489 to 08bda65 Compare August 13, 2025 14:41
It was already superseded in the most important places due to
`LedgerDbPruneBeforeSlot`. Its remaining use cases are non-essential:

 - Replay on startup.

   In this case, we never roll back, so not maintaining k states is actually an
   optimization here. We can also remove the now-redundant `InitDB.pruneDb`
   function.

 - Internal functions used for db-analyser.

   Here, we can just as well use `LedgerDbPruneAll` (which is used by
   `pruneToImmTipOnly`) as we never need to roll back.

 - Testing.

   In particular, we remove some DbChangelog tests that previously ensured that
   only at most @k@ states are kept. This is now no longer true; that property
   is instead enforced by the LedgerDB built on top of the DbChangelog.

   A follow-up commit in this PR enriches the LedgerDB state machine test to
   make sure that the public API functions behave appropriately, ensuring that
   we don't lose test coverage (and also testing V2, which previously didn't
   have any such tests).
Make sure that we correctly fail when trying to roll back too far.
@amesgen amesgen force-pushed the amesgen/ledgerdb-garbage-collect-states branch from 08bda65 to dec284f Compare August 13, 2025 15:27
Copy link
Contributor

@jasagredo jasagredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave another look to the PR. I therefore approve it again.

@amesgen amesgen added this pull request to the merge queue Sep 2, 2025
Merged via the queue into main with commit df88019 Sep 2, 2025
16 of 17 checks passed
@amesgen amesgen deleted the amesgen/ledgerdb-garbage-collect-states branch September 2, 2025 14:46
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Consensus Team Backlog Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

4 participants