Skip to content

Conversation

@kchojn
Copy link

@kchojn kchojn commented May 28, 2025

No description provided.

kchojn added 3 commits May 28, 2025 16:28
…ning to finality-based event syncing for stronger cryptographic guarantees and reliability in event processing.
…y in the document by removing unnecessary details and simplifying descriptions
…activation and define specific fork mechanism and activation epochs separately

feat(finality_based_event_syncing.md): update behavioral changes to include finality-based syncing and define pre-fork and post-fork behaviors
feat(finality_based_event_syncing.md): simplify key implementation points for one-way transition and automatic detection of fork activation
Copilot AI review requested due to automatic review settings May 28, 2025 13:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new SIP that replaces the current follow-distance syncing approach with a finality-based mechanism post-fork, aiming to reduce reorg risks and align with Ethereum’s security model.

  • Proposes switching from probabilistic follow-distance to finalized-block syncing after a configurable fork epoch
  • Defines behavioral changes, transition flow, performance trade-offs, and migration steps
  • Includes diagrams, performance impact analysis, and open discussion points
Comments suppressed due to low confidence (3)

sips/finality_based_event_syncing.md:1

  • The Author field is currently a placeholder ('xxx'). Please replace it with the actual author name or remove the placeholder.
| xxx    | Finality-Based Event Syncing | Core     | open-for-discussion | 2025-05-27 |

sips/finality_based_event_syncing.md:47

  • [nitpick] The code name 'Alan' isn’t defined or explained elsewhere. Consider clarifying its meaning or using a more descriptive label for the pre-fork mode.
**Pre-Fork (Alan)**:

sips/finality_based_event_syncing.md:131

  • [nitpick] There’s an extra leading space before 'typically'. Removing it will keep list formatting consistent.
  typically 2 epochs, so ~12.8 minutes) represents a significant change.

…ime-based and computed as `finalized_block.timestamp - processed_block.timestamp` for improved accuracy

feat(finality_based_event_syncing.md): adjust event processing to occur only when latency is less than or equal to staleness threshold
feat(finality_based_event_syncing.md): redefine node health as `finalized_timestamp - processed_timestamp ≤ staleness_threshold` for clarity and consistency
@GalRogozinski GalRogozinski mentioned this pull request Jun 29, 2025
13 tasks
Copy link
Contributor

@GalRogozinski GalRogozinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well written SIP :-)

I just don't think the staleness part should be here (I understand you crash the node if it is stale).
I am NOT saying you shouldn't implement this in SSV node.
The thing is that I think this is a subjective approach an implementation can take.
i.e. operators can rely on external monitoring tools to understand the health of their nodes

So unless sigma prime guys want to standardize this, it should be out.

…taleness check and health monitoring criteria for finality-based syncing
Copy link
Collaborator

@dknopik dknopik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Comment on lines +154 to +160
1. **Pre-Fork**: Node operators must update their SSV node software to a version supporting the "FinalityConsensus" fork
well in advance of the announced activation epoch for their respective network. Monitor official announcements for
activation epoch details.
2. **During Fork Activation**: The transition should be automatic once the node's current epoch reaches the
"FinalityConsensus" activation epoch. Operators should monitor node logs for confirmation of the new operational mode.
3. **Post-Fork**: Verify that the node is processing events based on finalized blocks. Node operators may implement
their own monitoring solutions as needed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very vague. When exactly do we stop syncing with the follow method? There are multiple ways this could be interpreted:

Let X be the first slot of the activation epoch. Let B(slot) be the head block of slot (which may be a block in a previous slot if slot is empty. Let E(slot) be the epoch of slot.

a) We simply stop syncing at X: the last block processed will be B(X - 1) - 8 until E(X - 1) finalizes.
b) We stop syncing after processing B(X - 1): basically syncing until the follow distance has caught up with the epoch end.

Which one do you mean?

Comment on lines +138 to +143
- **Enhanced Security**: Transitioning to cryptographic finality for event processing inherently strengthens the node
against state corruption or inconsistencies caused by execution layer reorganizations.
- **Trust Model**: Reliance shifts more explicitly towards the finality guarantees provided by Ethereum's consensus
layer, which is a core security assumption of the PoS network.
- **Attack Surface**: This change is expected to reduce the attack surface related to manipulating node state through EL
reorgs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not mention the significant disadvantage that it becomes impossible to remove validators during long periods of non-finality. See my comment below on my ideas how to mitigate this.

Comment on lines +84 to +97
graph TB
subgraph "Proposed"
A2[New Block] --> B2[Check Finality]
B2 --> C2[Wait Finalized]
C2 --> D2[Fetch Events]
D2 --> E2[Safe Process]
end

subgraph "Current"
A1[New Block] --> B1[Wait 8 Blocks]
B1 --> C1[Fetch Events]
C1 --> D1[Process]
D1 --> E1[RISK: Reorg]
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flow diagram includes nodes that are not really part of the flow ("Safe Process", "RISK: Reorg").

The trigger for each diagram is "New Block". This models the new process incorrectly IMO. Check Finality sounds like there are two possible outcomes (final, not final), but there is only one arrow.

I'd just put something like

graph TB
    subgraph "Proposed"
        A2[New finalized checkpoint] --> B2[Fetch events of new finalized blocks]
    end
Loading

to reflect that the trigger for sync progress is a new finalized checkpoint.

@dknopik
Copy link
Collaborator

dknopik commented Jul 15, 2025

IMO, we can introduce some mechanism to mitigate the long non-finality issue a bit. Basically, we want to be able to remove validators in order to run them somewhere else, or at least exit them from the active validator set.

Always perform exit events

Process validator exit events regardless of finality. This is safe, since it does not change validator ownership: We do not have a slashing risk as the validator is not removed from the cluster (and therefore can not be readded to another). Also, it does not matter if the event is reorged, as it does not affect the state.

Pending removal flag

Another idea is immediately checking all new blocks regardless of finality. For all validator removals in this block, the validator is marked "pending removal". For all "pending removal" validators:

  • Do not perform duties
  • Do not forward messages where one of the following is true:
    • Message ID contains the validator
    • The message is a partial signature message and one of the contained partial signatures is for the validator

The only way to remove a "pending removal" flag is by actually removing the validator (i.e. finalizing a block with a remove validator event). This should be reasonably safe: The user can observe if the transaction causes the validator to stop. Even if the event reorgs, the operators will remember the flag, and not perform duties. Great care should be taken to persist this flag! Ideally, the node should remember it even if the user decides to resync the node, e.g. by storing this information in a separate file.

The disadvantage of this is that the user still can not really use the validator differently, as the user does not have a slashing protection database for that validator. However, it is not impossible to e.g. get the slashing protection data from an operator and use it in an conventional validator client.

Note that it is not possible to make this 100% safe, as operators may lose this information, and therefore is not my favoured solution.

Disable validator config

This is similar to above solution, except it is just support for disabling a certain validator in an operator via a configuration file. The operator configure do this at the request of the validator owner. This would be very helpful for private clusters.

Note that the paragraph regarding the slashing protection data from above still applies.

Happy to hear comments on these ideas from you!

@y0sher
Copy link
Contributor

y0sher commented Aug 3, 2025

IMO, we can introduce some mechanism to mitigate the long non-finality issue a bit. Basically, we want to be able to remove validators in order to run them somewhere else, or at least exit them from the active validator set.

Always perform exit events

Process validator exit events regardless of finality. This is safe, since it does not change validator ownership: We do not have a slashing risk as the validator is not removed from the cluster (and therefore can not be readded to another). Also, it does not matter if the event is reorged, as it does not affect the state.

Pending removal flag

Another idea is immediately checking all new blocks regardless of finality. For all validator removals in this block, the validator is marked "pending removal". For all "pending removal" validators:

* Do not perform duties

* Do not forward messages where one of the following is true:
  
  * Message ID contains the validator
  * The message is a partial signature message and one of the contained partial signatures is for the validator

The only way to remove a "pending removal" flag is by actually removing the validator (i.e. finalizing a block with a remove validator event). This should be reasonably safe: The user can observe if the transaction causes the validator to stop. Even if the event reorgs, the operators will remember the flag, and not perform duties. Great care should be taken to persist this flag! Ideally, the node should remember it even if the user decides to resync the node, e.g. by storing this information in a separate file.

The disadvantage of this is that the user still can not really use the validator differently, as the user does not have a slashing protection database for that validator. However, it is not impossible to e.g. get the slashing protection data from an operator and use it in an conventional validator client.

Note that it is not possible to make this 100% safe, as operators may lose this information, and therefore is not my favoured solution.

Disable validator config

This is similar to above solution, except it is just support for disabling a certain validator in an operator via a configuration file. The operator configure do this at the request of the validator owner. This would be very helpful for private clusters.

Note that the paragraph regarding the slashing protection data from above still applies.

Happy to hear comments on these ideas from you!

Hi @dknopik ,
All of this ideas sound good to me as we were considering the exit/removal issue and it's sounds reasonable to mitigate this way, achieving the robustness against re-orgs but not hurting the UX or removing validators. Thanks for taking the time to review and add your input. We will update this PR and code soon to reflect the changes.

I'm not sure the disable validator config feature has to be part of this though, I understand why it's a feature that might be desired but doesn't have to do much with syncing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants