Skip to content

collator-protocol: Advertised collations are fetched only after reconnection #11625

@lexnv

Description

@lexnv

The collation advertisements are being ignored or significantly delayed by validators up to 30s.
This is degrading the block confidence, causing ~48 blocks to be reverted.

High rates of SecondedLimitReached errors suggest that the is_slot_available logic is overly restrictive, particularly during core rotations, leading to false rejections and stalled fetch operations.

Code Logic

Collation advertisements are inserted into a hashmap for spam protection and tracking purposes:

.insert_advertisement(
scheduling_parent,
Some(candidate_hash),
&state.implicit_view,
&per_scheduling_parent,
&state.leaf_claim_queues,
)

Before the validator attempts to fetch the collation, the validator checks if it has a free slot for the collation:

is_slot_available(&scheduling_parent, para_id, state).inspect_err(|error| {
gum::debug!(
target: LOG_TARGET,
?peer_id,
?scheduling_parent,
?para_id,
?error,
"Slot is not available",
);
})?;

If the validator decides that it doesn't have any free slots available, the collation is never fetched until a reconnection happens.

Timeline

The investigation was made possible using the block-confidence-monitor tool

  • T0: collation 0x0969 is produced and advertised (collator is connected to 1 out of 4 authorities)
  • T1: The remaining 3 authorities connect and the advertisemnt is sent to them as well
  • T2: ~24s (collator eviction policy) the authorities disconnect from our collator
  • T3: ~5s (backoff period on the notification protocol) connection with authorities is reestablished
  • T4: ~30s after the collation was advertised, the validators start fetching the collation

Root Cause

I believe the issue is around the is_slot_available function.
We are seeing too many "Slot is not available" with SecondedLimitReached:

The following PR was deployed on validators: #11610

Full grafana logs for yap-3428: https://grafana.teleport.parity.io/goto/6N_nNdpDR?orgId=1

Full details here: #11377 (comment)

Core Rotation Protection

When analyzing the path from logs (including scheduling parents and core indices):

[0x3ca8(core56), 0xcb16, 0xfc1c, 0x8f2b, 0x6da6, 0x067b(core55), 0xecf2(core55), 0x757f(leaf)]

The current implementation checks the claim queue for core 56 but counts candidates from the core 55 claim queue.
If core 55 has multiple candidates (0x067b and 0xecf2), they consume 2 slots allocated for core 56 leading to a SecondedLimitReached error.

The following diff isolates the core rotation and ensures candidates are only coutned if they belong to the same core currently being evaluated. This prevents core switching that blcoks valid slots:

  for (idx, ancestor) in path.iter().enumerate() {
      // Only count candidates on the same core
      let ancestor_core = state.per_scheduling_parent.get(ancestor)
          .map(|sp| sp.current_core);

      let (seconded_pending, waiting) = if ancestor_core == Some(current_core) {
          (state.seconded_and_pending_for_para(ancestor, &para_id),
           state.in_waiting_queue_for_para(ancestor, &para_id))
      } else {
          (0, 0)
      };

Deferred queue for stalled advertisements

An advertisement is inserted for tracking, but can fail in two code paths before fetching is initated:

// Check if there's a free slot accounting for obsolete positions and capacity.
// This happens AFTER hold-off logic (for AssetHub) has run, so held-off advertisements
// can be queued even when capacity is temporarily full.
is_slot_available(&scheduling_parent, para_id, state).inspect_err(|error| {

let can_second =
can_second(sender, para_id, scheduling_parent, candidate_hash, parent_head_data_hash)
.await;

Instead of discarding advertisements that fail initial checks (due to missing scheduling parents or unavailable slots), implement a deferred queue for stalled advertisements.
Advertisments are stored in this buffer and re-evaluated once slots are available. This ensures we are eventually fetching the collation without relying on peer reconnection.

cc @eskimor @sandreim @tdimitrov @skunert

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions