collator-protocol: Advertised collations are fetched only after reconnection

The collation advertisements are being ignored or significantly delayed by validators up to 30s. 
This is degrading the block confidence, causing ~48 blocks to be reverted.

High rates of `SecondedLimitReached` errors suggest that the `is_slot_available` logic is overly restrictive, particularly during core rotations, leading to false rejections and stalled fetch operations.

## Code Logic

Collation advertisements are inserted into a hashmap for spam protection and tracking purposes:

https://github.com/paritytech/polkadot-sdk/blob/33e6f752b393431850f6c845e9704d13a27a2f00/polkadot/node/network/collator-protocol/src/validator_side/mod.rs#L1839-L1845

Before the validator attempts to fetch the collation, the validator checks if it has a free slot for the collation:

https://github.com/paritytech/polkadot-sdk/blob/33e6f752b393431850f6c845e9704d13a27a2f00/polkadot/node/network/collator-protocol/src/validator_side/mod.rs#L1891-L1900

If the validator decides that it doesn't have any free slots available, the collation is never fetched until a reconnection happens.

### Timeline

The investigation was made possible using the [block-confidence-monitor tool](https://github.com/lexnv/block-confidence-monitor)

- T0: collation `0x0969` is produced and advertised (collator is connected to 1 out of 4 authorities)
- T1: The remaining 3 authorities connect and the advertisemnt is sent to them as well
- T2: ~24s (collator eviction policy) the authorities disconnect from our collator
- T3: ~5s (backoff period on the notification protocol) connection with authorities is reestablished
- T4: ~30s after the collation was advertised, the validators start fetching the collation


## Root Cause

I believe the issue is around the `is_slot_available` function.
We are seeing too many "Slot is not available" with `SecondedLimitReached`:
- https://grafana.teleport.parity.io/goto/OLJ9HOpvR?orgId=1

The following PR was deployed on validators: https://github.com/paritytech/polkadot-sdk/pull/11610

Full grafana logs for yap-3428: https://grafana.teleport.parity.io/goto/6N_nNdpDR?orgId=1

Full details here: https://github.com/paritytech/polkadot-sdk/issues/11377#issuecomment-4170503625

### Core Rotation Protection

When analyzing the path from logs (including scheduling parents and core indices):

```
[0x3ca8(core56), 0xcb16, 0xfc1c, 0x8f2b, 0x6da6, 0x067b(core55), 0xecf2(core55), 0x757f(leaf)]
```

The current implementation checks the claim queue for `core 56` but counts candidates from the `core 55` claim queue.
If `core 55` has multiple candidates (0x067b and 0xecf2), they consume 2 slots allocated for `core 56` leading to a SecondedLimitReached error.


The following diff isolates the core rotation and ensures candidates are only coutned if they belong to the same core currently being evaluated. This prevents core switching that blcoks valid slots:

```rust
  for (idx, ancestor) in path.iter().enumerate() {
      // Only count candidates on the same core
      let ancestor_core = state.per_scheduling_parent.get(ancestor)
          .map(|sp| sp.current_core);

      let (seconded_pending, waiting) = if ancestor_core == Some(current_core) {
          (state.seconded_and_pending_for_para(ancestor, &para_id),
           state.in_waiting_queue_for_para(ancestor, &para_id))
      } else {
          (0, 0)
      };
```

### Deferred queue for stalled advertisements

An advertisement is inserted for tracking, but can fail in two code paths before fetching is initated:

https://github.com/paritytech/polkadot-sdk/blob/33e6f752b393431850f6c845e9704d13a27a2f00/polkadot/node/network/collator-protocol/src/validator_side/mod.rs#L1888-L1891

https://github.com/paritytech/polkadot-sdk/blob/33e6f752b393431850f6c845e9704d13a27a2f00/polkadot/node/network/collator-protocol/src/validator_side/mod.rs#L1906-L1908

Instead of discarding advertisements that fail initial checks (due to missing scheduling parents or unavailable slots), implement a deferred queue for stalled advertisements. 
Advertisments are stored in this buffer and re-evaluated once slots are available. This ensures we are eventually fetching the collation without relying on peer reconnection.

cc @eskimor @sandreim @tdimitrov @skunert 


	.insert_advertisement(
	scheduling_parent,
	Some(candidate_hash),
	&state.implicit_view,
	&per_scheduling_parent,
	&state.leaf_claim_queues,
	)

	is_slot_available(&scheduling_parent, para_id, state).inspect_err(\|error\| {
	gum::debug!(
	target: LOG_TARGET,
	?peer_id,
	?scheduling_parent,
	?para_id,
	?error,
	"Slot is not available",
	);
	})?;

	// Check if there's a free slot accounting for obsolete positions and capacity.
	// This happens AFTER hold-off logic (for AssetHub) has run, so held-off advertisements
	// can be queued even when capacity is temporarily full.
	is_slot_available(&scheduling_parent, para_id, state).inspect_err(\|error\| {

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collator-protocol: Advertised collations are fetched only after reconnection #11625

Code Logic

Timeline

Root Cause

Core Rotation Protection

Deferred queue for stalled advertisements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	let can_second =
	can_second(sender, para_id, scheduling_parent, candidate_hash, parent_head_data_hash)
	.await;

collator-protocol: Advertised collations are fetched only after reconnection #11625

Description

Code Logic

Timeline

Root Cause

Core Rotation Protection

Deferred queue for stalled advertisements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions