Conversation
For simplicity, reasoning and efficiency.
+ Refactor/cleanup + some docs.
parents Remove per-parachain tracking of allowed relay parents in backing implicit view. Allowed relay parents are now computed per relay parent (globally) based on scheduling lookahead, rather than per parachain. Changes: - Remove `collating_for` parameter and para_id tracking from View - Update all callers to remove para_id arguments - Refactor tests with helper functions to reduce ~150 lines of duplication - Add comprehensive documentation to tests explaining expected behavior - Clarify `paths_via_relay_parent` returns full paths from oldest block to leaf This simplification aligns with the reality that all parachains share the same allowed relay parent windows at a given relay chain block.
+ Fix a typo.
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
|
/cmd fmt |
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
|
/cmd fmt |
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
alindima
left a comment
There was a problem hiding this comment.
Finally managed to root-cause all bugs and get the tests running well with older relay parents.
Here is a hacky implementation of the fixes: a681cae
Besides the review comments I left, here are the most critical issues that were preventing us from using older relay parents:
During the parent search, several things need to change:
- we use an
ancestry_lookbackvalue equal to the scheduling lookahead. This is not sufficiently large if v3 is enabled. If v3 is enabled, we allow to build on top of relay parents that are not more thanmax_relay_parent_session_agesessions old. A good fix for this would be to overlook the result ofbuild_relay_parent_ancestryand instead verify the session indices of the relay parents of the candidates duringfind_deepest_valid_parent, to not be older than themax_relay_parent_session_age. We probably also want to limit this ancestry however by the configured RelayParentOffset. - add a new parameter: best_hash, which is the current relay chain best hash that we're picking as a scheduling parent. This new value needs to be used inside find_parent_for_building, for calling
fetch_included_from_relay_chainandpersisted_validation_data. At the moment the lack of this change attempts to cause parachain forks on already finalized blocks (but is caught and prevented by the substrate backend: "Potential long-range attack: block not in finalized chain."). - This one was very nasty to discover:
find_deepest_valid_parentis currently causing a very large lag in backing candidates after a session change. With v3 candidates, the relay parent can be different than the scheduling parent. Moreover, the relay parent can be from a previous session, but the scheduling parent cannot. Since we don't have resubmissions, we need to stop the parent search on a branch if the scheduling parent of a candidate we're visiting is from a session different than the current one (of the best relay chain block). This is tricky: at this point, we only know the parachain block, not the candidate. The parachain block does not contain any information about the scheduling parent of the candidate it was included in. We need to temporarily persist the scheduling parent of yet unincluded candidates on the collator side. Luckily, the candidate receipt is part of the BlockAnnounceData so we can access this data right when getting the block (because of pre async-backing stuff). I don't see this however being propagated over to block import process, so we may need to either do some hacks and keep track of this state somewhere in the collator or modify some substrate node interfaces to also get the announcement data during the block import.
| .await | ||
| else { | ||
| continue; | ||
| }; |
There was a problem hiding this comment.
We should also check if v3 is enabled on the relay chain network
| let best_hash = para_client.info().best_hash; | ||
| let v3_enabled = | ||
| para_client.runtime_api().scheduling_v3_enabled(best_hash).unwrap_or(false); | ||
| slot_timer.set_v3_enabled(v3_enabled); |
There was a problem hiding this comment.
the relay parent offset needs to be validated against the relay chain constraints:
The relay chain has a new configuration item accessible via runtime API: max_relay_parent_session_age (if v3 node feature is enabled on the relay chain).
By default it will be 0, meaning that the relay chain will allow any relay parent from the current session.
If it is larger than 0, we need to start taking into account relay parents from the previous sessions also (how many sessions depends on the value returned by the runtime API)
| client: Arc<Client>, | ||
| time_offset: Duration, | ||
| relay_slot_duration: Duration, | ||
| v3_enabled: bool, |
There was a problem hiding this comment.
can't we instead just make the offset 0 is v3 is enabled?
| relay_parent_num = %relay_parent_header.number(), | ||
| relay_parent_offset, | ||
| claim_queue_offset, | ||
| v3_enabled, |
There was a problem hiding this comment.
commenting here because I cannot on the untouched code:
relay_proof_request is going to be ignored by the runtime if v3 is enabled, so we can skip supplying it in that case
| @@ -550,7 +612,9 @@ where | |||
| .await? | |||
There was a problem hiding this comment.
you need to stop the search if you reach the genesis block (otherwise you will get an error).
you also need to stop the search if you reach a block where this parachain did not even exist on the relay chain. Otherwise, the candidates will have PVDs (PersistedValidationData) that do not even exist on the relay chain and you'll build useless candidates
There was a problem hiding this comment.
however, if you don't supply RelayParentOffset blocks in the validation function, you will fail the validation of the block. But in that case you need to wait until you get enough relay chain blocks built so that you have a sufficiently old relay parent.
This is also a zombienet test worth adding (later, once older relay parents work is all merged): a newly-onboarded v3 chain, which is not added at genesis time
| assert_validator_backed_candidates(relay_node, 30).await?; | ||
| for i in 3..=5 { |
There was a problem hiding this comment.
why don't we test for all of the validators?
|
|
||
| #[rstest] | ||
| #[case::zero_offset("async-backing-v3")] | ||
| #[case::with_rpo("async-backing-v3-rpo")] |
There was a problem hiding this comment.
rpo is very cryptic. I assume it stands for relay parent offset. also it's best practice to add some comments on what the test is doing (like you did for the other test)
| }); | ||
|
|
||
| // Experimental collator protocol validators. | ||
| (6..10).fold(r, |acc, i| { |
There was a problem hiding this comment.
the number of validators is excessive, considering that you only have 3 cores and 2 validators per core.
|
|
||
| if sc_consensus_babe::contains_epoch_change::<RelayBlock>(&relay_header) { | ||
| tracing::debug!(target: LOG_TARGET, ?relay_best_block, relay_best_block_number = relay_header.number(), "Relay parent is in previous session."); | ||
| tracing::debug!(target: LOG_TARGET, ?relay_best_block, relay_best_block_number = relay_header.number(), "RC tip is a session change block, skipping."); |
There was a problem hiding this comment.
this needs to be skipped if v3 is enabled. We don't care if the relay chain best block is the last one in a session. We can build legitimate candidates on it because these candidates will have relay parents from the previous session
There was a problem hiding this comment.
Sure. IIUC, submitting legitimate candidates shouldn't care whether we're close to session boundary now, risking forks. The assumption is that soon next authors will be able to pick up those blocks and resubmit, in case the seconded/backed offchain signals are not arriving in time (once req/resp prospective-parachains protocol could be used), or in case we don't see inclusion event within unincluded segment relay chain blocks imported (this way we avoid forking, by not needing to rebuild that same block).
| const RELAY_PARENT_OFFSET: u32 = 0; | ||
|
|
||
| #[cfg(any(feature = "async-backing-v3", feature = "elastic-scaling-v3"))] | ||
| const SCHEDULING_V3_ENABLED: bool = true; |
There was a problem hiding this comment.
UNINCLUDED_SEGMENT_CAPACITY needs to be a function of the RELAY_PARENT_OFFSET also is scheduling v3 is enabled
|
btw, if you want to test with a large relay parent offset, you need the following fixes on the relay chain: https://github.com/paritytech/polkadot-sdk/commits/alindima/prospective-parachains-older-rps/ (still in draft but working) Will open a proper PR with the relay chain fixes soon |
|
Leaving here some thoughts to discuss tomorrow:
Which ancestry do you think we need to limit by RelayParentOffset? If we drop the
Can you detail a bit more with an example, like how you've hit this? Especially how a fork on already finalized blocks could happen if querying the pvd at relay parent. I mean, yes, if the relay parent is quite old, pointing to pvd with para heads that have been already finalized, issues can arise. I'd expect to have something really weird happening to end up in this situation. Getting para heads finalized should also mean the para head advances, and the collator sees that via relay chain consensus, and block building para best hash should reflect it when building the next block. Resubmission should not happen for a finalized relay parent either - so we'd need to keep a view over announced blocks delta what's visible as included later on, when importing a relay chain block.
Ugh, this sounds nasty. Maybe there is another way (req/resp based) to find out the scheduling parent session of the parents we're traversing via DFS, while building against a certain scheduling parent. We might be able to use the fact that for a para parent, if we query the para client at it we can find the relay parent offset. If we add this to the relay parent number of the para parent (I think we can find its hash in the para block digest, and then its header with the relay client, and then the number), we'll find out the scheduling parent number, and then be able to fetch its header with the relay client as well, and in the end we can query the session from the relay client based on header.hash(). |
Building on top of #10472
Re-submissions are not yet handled - and not goal of this PR so far was to be able to have e2e tests for #10472. Verifying that we have everything to support them is necessary though - especially if we forgot something needed from the relay chain.
Full re-submissions are more involved and I would leave them out for a separate PR, so we can get this one merged quickly.