Skip to content

Candidate descriptor v3 cumulus changes#10742

Draft
eskimor wants to merge 139 commits intomasterfrom
rk-cumulus-v3-integration
Draft

Candidate descriptor v3 cumulus changes#10742
eskimor wants to merge 139 commits intomasterfrom
rk-cumulus-v3-integration

Conversation

@eskimor
Copy link
Copy Markdown
Member

@eskimor eskimor commented Jan 7, 2026

Building on top of #10472

Re-submissions are not yet handled - and not goal of this PR so far was to be able to have e2e tests for #10472. Verifying that we have everything to support them is necessary though - especially if we forgot something needed from the relay chain.

Full re-submissions are more involved and I would leave them out for a separate PR, so we can get this one merged quickly.

eskimor and others added 30 commits December 17, 2025 16:00
For simplicity, reasoning and efficiency.
+ Refactor/cleanup + some docs.
parents

Remove per-parachain tracking of allowed relay parents in backing
implicit view. Allowed relay parents are now computed per relay parent
(globally) based on scheduling lookahead, rather than per parachain.

Changes:
- Remove `collating_for` parameter and para_id tracking from View
- Update all callers to remove para_id arguments
- Refactor tests with helper functions to reduce ~150 lines of
  duplication
- Add comprehensive documentation to tests explaining expected behavior
- Clarify `paths_via_relay_parent` returns full paths from oldest block
  to leaf

This simplification aligns with the reality that all parachains share
the same allowed relay parent windows at a given relay chain block.
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
@iulianbarbu
Copy link
Copy Markdown
Contributor

/cmd fmt

github-actions bot and others added 10 commits March 29, 2026 17:01
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
@iulianbarbu
Copy link
Copy Markdown
Contributor

/cmd fmt

github-actions bot and others added 12 commits March 30, 2026 15:22
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Copy link
Copy Markdown
Contributor

@alindima alindima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally managed to root-cause all bugs and get the tests running well with older relay parents.

Here is a hacky implementation of the fixes: a681cae

Besides the review comments I left, here are the most critical issues that were preventing us from using older relay parents:

During the parent search, several things need to change:

  1. we use an ancestry_lookback value equal to the scheduling lookahead. This is not sufficiently large if v3 is enabled. If v3 is enabled, we allow to build on top of relay parents that are not more than max_relay_parent_session_age sessions old. A good fix for this would be to overlook the result of build_relay_parent_ancestry and instead verify the session indices of the relay parents of the candidates during find_deepest_valid_parent, to not be older than the max_relay_parent_session_age. We probably also want to limit this ancestry however by the configured RelayParentOffset.
  2. add a new parameter: best_hash, which is the current relay chain best hash that we're picking as a scheduling parent. This new value needs to be used inside find_parent_for_building, for calling fetch_included_from_relay_chain and persisted_validation_data. At the moment the lack of this change attempts to cause parachain forks on already finalized blocks (but is caught and prevented by the substrate backend: "Potential long-range attack: block not in finalized chain.").
  3. This one was very nasty to discover: find_deepest_valid_parent is currently causing a very large lag in backing candidates after a session change. With v3 candidates, the relay parent can be different than the scheduling parent. Moreover, the relay parent can be from a previous session, but the scheduling parent cannot. Since we don't have resubmissions, we need to stop the parent search on a branch if the scheduling parent of a candidate we're visiting is from a session different than the current one (of the best relay chain block). This is tricky: at this point, we only know the parachain block, not the candidate. The parachain block does not contain any information about the scheduling parent of the candidate it was included in. We need to temporarily persist the scheduling parent of yet unincluded candidates on the collator side. Luckily, the candidate receipt is part of the BlockAnnounceData so we can access this data right when getting the block (because of pre async-backing stuff). I don't see this however being propagated over to block import process, so we may need to either do some hacks and keep track of this state somewhere in the collator or modify some substrate node interfaces to also get the announcement data during the block import.

.await
else {
continue;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check if v3 is enabled on the relay chain network

let best_hash = para_client.info().best_hash;
let v3_enabled =
para_client.runtime_api().scheduling_v3_enabled(best_hash).unwrap_or(false);
slot_timer.set_v3_enabled(v3_enabled);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the relay parent offset needs to be validated against the relay chain constraints:

The relay chain has a new configuration item accessible via runtime API: max_relay_parent_session_age (if v3 node feature is enabled on the relay chain).
By default it will be 0, meaning that the relay chain will allow any relay parent from the current session.

If it is larger than 0, we need to start taking into account relay parents from the previous sessions also (how many sessions depends on the value returned by the runtime API)

client: Arc<Client>,
time_offset: Duration,
relay_slot_duration: Duration,
v3_enabled: bool,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we instead just make the offset 0 is v3 is enabled?

relay_parent_num = %relay_parent_header.number(),
relay_parent_offset,
claim_queue_offset,
v3_enabled,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commenting here because I cannot on the untouched code:

relay_proof_request is going to be ignored by the runtime if v3 is enabled, so we can skip supplying it in that case

@@ -550,7 +612,9 @@ where
.await?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to stop the search if you reach the genesis block (otherwise you will get an error).
you also need to stop the search if you reach a block where this parachain did not even exist on the relay chain. Otherwise, the candidates will have PVDs (PersistedValidationData) that do not even exist on the relay chain and you'll build useless candidates

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, if you don't supply RelayParentOffset blocks in the validation function, you will fail the validation of the block. But in that case you need to wait until you get enough relay chain blocks built so that you have a sufficiently old relay parent.

This is also a zombienet test worth adding (later, once older relay parents work is all merged): a newly-onboarded v3 chain, which is not added at genesis time

Comment on lines +102 to +103
assert_validator_backed_candidates(relay_node, 30).await?;
for i in 3..=5 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we test for all of the validators?


#[rstest]
#[case::zero_offset("async-backing-v3")]
#[case::with_rpo("async-backing-v3-rpo")]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rpo is very cryptic. I assume it stands for relay parent offset. also it's best practice to add some comments on what the test is doing (like you did for the other test)

});

// Experimental collator protocol validators.
(6..10).fold(r, |acc, i| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of validators is excessive, considering that you only have 3 cores and 2 validators per core.


if sc_consensus_babe::contains_epoch_change::<RelayBlock>(&relay_header) {
tracing::debug!(target: LOG_TARGET, ?relay_best_block, relay_best_block_number = relay_header.number(), "Relay parent is in previous session.");
tracing::debug!(target: LOG_TARGET, ?relay_best_block, relay_best_block_number = relay_header.number(), "RC tip is a session change block, skipping.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be skipped if v3 is enabled. We don't care if the relay chain best block is the last one in a session. We can build legitimate candidates on it because these candidates will have relay parents from the previous session

Copy link
Copy Markdown
Contributor

@iulianbarbu iulianbarbu Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. IIUC, submitting legitimate candidates shouldn't care whether we're close to session boundary now, risking forks. The assumption is that soon next authors will be able to pick up those blocks and resubmit, in case the seconded/backed offchain signals are not arriving in time (once req/resp prospective-parachains protocol could be used), or in case we don't see inclusion event within unincluded segment relay chain blocks imported (this way we avoid forking, by not needing to rebuild that same block).

const RELAY_PARENT_OFFSET: u32 = 0;

#[cfg(any(feature = "async-backing-v3", feature = "elastic-scaling-v3"))]
const SCHEDULING_V3_ENABLED: bool = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNINCLUDED_SEGMENT_CAPACITY needs to be a function of the RELAY_PARENT_OFFSET also is scheduling v3 is enabled

@alindima
Copy link
Copy Markdown
Contributor

alindima commented Apr 1, 2026

btw, if you want to test with a large relay parent offset, you need the following fixes on the relay chain: https://github.com/paritytech/polkadot-sdk/commits/alindima/prospective-parachains-older-rps/ (still in draft but working)

Will open a proper PR with the relay chain fixes soon

@iulianbarbu
Copy link
Copy Markdown
Contributor

iulianbarbu commented Apr 1, 2026

Leaving here some thoughts to discuss tomorrow:

If v3 is enabled, we allow to build on top of relay parents that are not more than max_relay_parent_session_age sessions old. A good fix for this would be to overlook the result of build_relay_parent_ancestry and instead verify the session indices of the relay parents of the candidates during find_deepest_valid_parent, to not be older than the max_relay_parent_session_age. We probably also want to limit this ancestry however by the configured RelayParentOffset.

Which ancestry do you think we need to limit by RelayParentOffset? If we drop the build_relay_parent_ancestry, the find_deepest_valid_parent will attempt a dfs starting from last included/pending block. Slot based collator can not build unincluded chains longer than a certain value, and forks shouldn't increase the width of the graph to a worring size (we don't have a protection against this right now either). I'd explore the children list for each descendant and update the parent (we'll build against) until exhausting the graph, with the condition that any children list added should be for a parent which its associated relay parent's session is >= scheduling parent's session - max_relay_parent_session_age (more like the scheduling parent's session for the candidate representing the para block to be >= scheduling parent's session - max_relay_parent_session_ag). In practice it shouldn't hit relay parent offset or scheduling lookahead limits, and if it is hit before we exhaust the graph, then I'd say something is misbehaving.

add a new parameter: best_hash, which is the current relay chain best hash that we're picking as a scheduling parent. This new value needs to be used inside find_parent_for_building, for calling fetch_included_from_relay_chain and persisted_validation_data. At the moment the lack of this change attempts to cause parachain forks on already finalized blocks (but is caught and prevented by the substrate backend: "Potential long-range attack: block not in finalized chain.").

Can you detail a bit more with an example, like how you've hit this? Especially how a fork on already finalized blocks could happen if querying the pvd at relay parent. I mean, yes, if the relay parent is quite old, pointing to pvd with para heads that have been already finalized, issues can arise. I'd expect to have something really weird happening to end up in this situation. Getting para heads finalized should also mean the para head advances, and the collator sees that via relay chain consensus, and block building para best hash should reflect it when building the next block. Resubmission should not happen for a finalized relay parent either - so we'd need to keep a view over announced blocks delta what's visible as included later on, when importing a relay chain block.

With v3 candidates, the relay parent can be different than the scheduling parent. Moreover, the relay parent can be from a previous session, but the scheduling parent cannot. Since we don't have resubmissions, we need to stop the parent search on a branch if the scheduling parent of a candidate we're visiting is from a session different than the current one (of the best relay chain block). This is tricky: at this point, we only know the parachain block, not the candidate. The parachain block does not contain any information about the scheduling parent of the candidate it was included in. We need to temporarily persist the scheduling parent of yet unincluded candidates on the collator side. Luckily, the candidate receipt is part of the BlockAnnounceData so we can access this data right when getting the block (because of pre async-backing stuff). I don't see this however being propagated over to block import process, so we may need to either do some hacks and keep track of this state somewhere in the collator or modify some substrate node interfaces to also get the announcement data during the block import.

Ugh, this sounds nasty. Maybe there is another way (req/resp based) to find out the scheduling parent session of the parents we're traversing via DFS, while building against a certain scheduling parent. We might be able to use the fact that for a para parent, if we query the para client at it we can find the relay parent offset. If we add this to the relay parent number of the para parent (I think we can find its hash in the para block digest, and then its header with the relay client, and then the number), we'll find out the scheduling parent number, and then be able to fetch its header with the relay client as well, and in the end we can query the session from the relay client based on header.hash().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A5-run-CI Run CI on draft PR T20-low-latency for issues and PRs related to Low Latency work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants