Skip to content

Decide on mitigation of missed leadership checks due to ledger snapshotsΒ #868

@amesgen

Description

@amesgen

Problem description

John Lotoski informed us that currently on Cardano mainnet, adequately resourced nodes (well above minimum specs) are missing lots of leadership checks during ledger snapshots.

Concretely, during every ledger snapshot (performed every 2k seconds = 72min by default), which takes about ~2min, the node misses ~30 leadership checks with 32GB RAM, and ~100 with 16GB RAM. This means that the node is missing ~0.7-2.3% of its leadership opportunities, and without mitigations, this number will likely grow as the size of the ledger state increases over time.

This problem is not a new one, it has existed since at least node 8.0.0 (and likely even before).

Analysis

Various experiments (credits to John Lotoski) indicate that this problem is due to high GC load/long GC pauses while taking a ledger snapshot (current mainnet size is ~2.6GB serialized). The main reasons for this belief are:

  • Using --nonmoving-gc fixes the problem for some time.1

  • Judging from a 6h log excerpt, both GC time and missed slots increase greatly during a ledger snapshot:

    chart-gc-snapshot

    (GC time comes from gc_cpu_ns)

  • Changing other aspects of the machine running the node (compute, IOPS) has no effect.

Potential mitigations

Several orthogonal mitigation options have been raised:

  1. Try to find a set of RTS options that fixes the problem without any code changes. In particular, it might be the case that --nonmoving-gc on a more recent GHC is enough, see 1.
  2. Incrementalising the creation of ledger snapshots. Specifically, this could be achieved by introducing a delay between writing individual chunks of the snapshot to disk. One hope here is that by spreading out the allocation work over a longer period of time (default GC interval is 72min), the total time spent GCing will be less, but a priori, it might just be the case that we miss as many slots as before, just spread out more evenly over a longer period of time.
  3. Don't take ledger snapshots around the time the node is elected. Since we know in advance when the node will be elected, we can take this into account when deciding whether to take a ledger snapshot, and choose not to do so if an election opportunity is imminent (eg <5min). This way, we are guaranteed not to miss slots due to this problem when it is actually important.
  4. Try to optimize allocation behavior of ledger snapshot serialization There might be some opportunities to improve the existing code to not allocate as much or in a gentler way, eg maybe by using unpinned instead of pinned (ByteString) chunks?

Note that UTxO HD will also help, but it will likely not be used for some time by block producers (where this is issue is actually important).


The goal of this ticket is to interact with other teams/stakeholders to identify the best way forward here.

Footnotes

  1. Quoting from John Lotoski:

    Trying the non-moving GC out did in fact resolve the missed slots, at least for about 5 days, after which missed slots started happening again, and eventually they surpassed the default copying GC missed slots in quantity of about 30 per hour and looked to continue increasing further over time.
    I suspect this is due to increasing fragmentation which the default copying GC is better at minimizing. There have been several improvements and bug fixes in the non-moving garbage collector through GHC 9.4.X, including improving the non-moving GC's ability to return allocated memory back to the OS which doesn't seem to happen at all on 8.10.7, so perhaps there might be better results with the non-moving GC once node is compiled on GHC >= ~9.4.X. In any case, when using the non-moving GC on 8.10.7, there are no new observed segfaults or other immediately obvious problem related to the use of the non-moving GC.

    ↩ ↩2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    πŸ”– Ready

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions