-
Notifications
You must be signed in to change notification settings - Fork 143
Description
I've been looking into QBFT and proposer duty recently a bit deeper, might be missing some context but I think there is a liveness issue (probably has been discussed in the past, but it's best to document such findings to serve as docs if nothing else) in proposer runner. Namely,
once proposer runner reconstructed partial RANDAO signatures it immediately makes a blocking call to Beacon node (GetBeaconBlock) which can take arbitrary amount of time to finish (production logs sometimes show this request can last 1-3s), it doesn't even matter if there a timeout on this call set or not - there is no reason to block here unless proposer runner is gonna be the leader of qbft consensus that comes next (and even if he is the leader - blocking here is suboptimal as I describe below). The liveness suffers because while proposer runner is blocked he can't participate in qbft at all (so, he prevents himself from executing the next stage of duty even though he might never need this Beacon block data he is waiting for - that is, if he never needs to lead a round, or if he uses justified prepared value when leading round changes instead of his own).
So, instead of blocking on GetBeaconBlock we want to make an asynchronous call (perhaps even with retries, but these would need to be fast) and use the data if/when it comes. I haven't studied the relevant code in detail, but we'll need to start QBFT Instance and block its execution only if the proposer runner turns out to be a leader for Round 1 or any other Rounds that follow (plus we'll need to be able to update Instance.StartValue once Beacon data eventually comes, obviously in concurrently safe manner). This way we will be able to participate in QBFT consensus ASAP, but not as a leader. This is a low hanging fruit. We can do better than that,
I see 2 potential options on what we can do if proposer runner wants to be a leader but can't get data from Beacon node:
- leader yielding his round voluntarily: leader, instead of starting qbft with proposal (since he doesn't have a block from Beacon node) will do a sort of round-change - the main difference, I think, is that every node can verify the validity of such round change immediately just by checking initiator signature + that he is indeed the leader (for target slot, and current round) when receiving such message - hence it should be pretty fast to change leader this way, and it should also work when proposer runner is chosen to be the leader of round change (note: we also would need to 1) store it with
Instance.Stateto make sure yielded leader doesn't attempt to propose in yielded round 2) addyieldJustificationsto pass along with Propose msg so that leader(s) who took over can justify proposing in the round they weren't meant to) - operators can broadcast blocks fetched from Beacon node (
GetBeaconBlock) to other operators in their cluster (those who are expected to perform same proposer duty) as soon as they fetch these blocks, this way operator doesn't 100% depend on Beacon node to provide block and rather he can just take any of the blocks sent by peers in the same cluster and propose these as his own (there are options, but the general idea is: operator can wait certain amount of time for Beacon request to execute, lets say 500ms, and once this time is exceeded operator can try to fallback to the block provided by peer(s)); we will also need to verify peer blocks withProposedValueCheckFbefore proposing as our own (to ensure peer isn't suggesting us to sign on the block with a slot we've already signed another block for, thus we can't be slashed; note, we do this check for the block we are getting from Beacon node too - as we should), but in terms of security nothing really changes, I think; Not sure if this is worth the effort though - if we cannot fetch the block from Beacon node likely we won't be able to broadcast it either via that same Beacon node (since the most likely root cause for this is network issues, hence it applies to all endpoints), and this might not work for blinded blocks
We can even try to combine both (although, it doesn't seem to be any better than just doing approach 2, if leader can't receive any peer messages with data to propose - it's unlikely he'll do better at sending "yield" messages out; I don't know enough about libp2p to make a conclusion on this), WDYT ?
Btw, it seems other duty types suffer from similar issue (although they probably have more time allocated for execution than proposer-runner)
And to go one step further:
- (whole another topic, but somewhat relevant) it seems duties are "fragile" in a sense that duty terminates once it encounters any error; this means any "temporary" error (such as loss of network packets between SSV and Beacon nodes, for example) will terminate duty forever rendering operator unavailable to participate in QBFT consensus for this particular duty (and I don't see a mechanism for duty restart - which is one way to address that; but probably addressing each temporary error locally by retrying the offending action in a certain way is just better way to do it, cause we don't want to redo all the work and it's an additional unnecessary latency)
- (whole another topic, but somewhat relevant) finally, in case of Beacon node outage (not temporary blip) operator goes offline (keeps restarting until beacon node becomes healthy); not sure if we can do better on this front, but as per my current understanding operator could stay online and keep singing off on some but not other duty types (eg. signing off on Ethereum blocks he can do, while add/remove validator events he just can't confirm/deny happened or not) as long as he can check for himself that he won't be committing slashable Ethereum offense