Skip to content

Conversation

@rogerANZA
Copy link
Contributor

No description provided.

@rogerANZA rogerANZA changed the title Simple Alpenglow Clock SIMD-0363: Simple Alpenglow Clock Sep 19, 2025
@ksn6
Copy link
Contributor

ksn6 commented Sep 22, 2025

Could you articulate how these clock conditions interact with fast leader handover / a changing parent + the fact that your approach still works under these conditions?

Separately - with this SIMD, it may be worth including a "parent block (slot, hash)" field in the block footer in addition to the existing clock. This would be the "final parent" of the block, taking fast leader handover into account. Two good things happen as a result of this:

(1) given the final parent, we can skip a block with a bad clock quickly on shred ingest rather than after replay, since the first shred of the last FEC set block marker** will contain both the clock + last parent.

(2) repair in a block with an UpdateParent marker gets slightly better because now there are two shreds with the "final parent" that race each other

EDIT: fixed an issue with comment (1) above.


In Alpenglow, the current block leader includes an updated integer clock value
(Rust system time in nanoseconds) in its block *b* in slot *s* in the block
footer. This value is bounded by the clock value in the parent block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At what point should we get the system clock time? When it is the time to create our leader slot? When we actually send out the block footer? Or just refer to what the current clock sysvar says?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You insert your system time when you produce the slice with the clock info in the block marker.

Copy link
Contributor

@ksn6 ksn6 Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean - what precisely do you mean by "system time when you produce the slice?"

E.g. is this:

  • The beginning of block production for a slot?
  • The end of block production for a slot, post block building?
  • When we actually create the block footer, post block building, and pre-shredding for turbine dissemination?

The specific choice isn't really enforceable, although, for default implementation purposes / consistency, it's probably worth opining on the specifics in the SIMD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these are possible (and enforceable). This depends on how we organize meta-data in a block (i.e., our discussion in Chicago). If we may change the parent, then the header (the first slice) is the wrong place, but the footer (the last slice) should be alright.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it looks like nothing prevents a validator from reporting a timestamp that is off by +/- 50 ms, equating to an honest validator reporting a timestamp that is at the beginning of block production vs. maybe even somewhere in the middle of block production or even the end of block production.

To somewhat make the notion of a clock uniform across validator implementations, we should probably specify roughly at what point in a leader's production of a block the timestamp should be captured.

If this can somehow be enforced, that's even better imho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it looks like nothing prevents a validator from reporting a timestamp that is off by +/- 50 ms...

Well, I would say it can even be up to 800 ms wrongly reported. Nothing we can do here.

But we established that the clock should be in the block footer, and we established that it should be captured before putting it it there, so what's still unclear?

Copy link
Contributor

@ksn6 ksn6 Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these are possible (and enforceable).

Well, I would say it can even be up to 800 ms wrongly reported. Nothing we can do here.

This part is a bit unclear - these two statements seem a bit contradictory. I agree with the second statement you're making re. being off by up to 800 ms, though.

But we established that the clock should be in the block footer, and we established that it should be captured before putting it it there, so what's still unclear?

The part that's unclear - at what exact part of the leader block production phase should we have leaders set the block timestamp?

Should the timestamp be associated with when the leader starts producing their block? When the leader conclusively knows what the block parent is? When the leader actually constructs the footer itself?

I'm aware that it's impossible to enforce the particular point in time, as you pointed out re. the 800 ms piece. This being said, it's worth having a sane default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We assume that a correct leader inserts its correct local time..."

Maybe this is not clear enough? When you insert the time value in the footer you insert your local time at that point. This way we have the least (additional) skew. Note that we will still have a systematic lag since the slice then has be encoded, sent through Rotor/Turbine, and decoded on the receiver side. We might consider to take that systematic lag into account.

@rogerANZA
Copy link
Contributor Author

Could you articulate how these clock conditions interact with fast leader handover / a changing parent + the fact that your approach still works under these conditions?

This should be mentioned in the fast leader handover SIMD in my opinion. (1) If a parent is changed and we consider everything before the parent change irrelevant, then we must include the clock again. (2) If the metadata of the block before the parent change is still relevant even after the parent change, then we don't include it again. In any case, we must have exactly 1 valid clock entry per block. But mentioning in the fast handover SIMD is more natural because it's the same exception for all the cases.

Separately - with this SIMD, it may be worth including a "parent block (slot, hash)" field in the block footer in addition to the existing clock. This would be the "final parent" of the block, taking fast leader handover into account.

We could decide to have the clock always in the last slice (if we have other information that always goes in the last slice). If this is the way to go, then we can describe it here as well.

Thanks for these good questions.

@ksn6
Copy link
Contributor

ksn6 commented Sep 22, 2025

(1) If a parent is changed and we consider everything before the parent change irrelevant, then we must include the clock again

What do you mean by "we must include the clock again?"

The block footer, which is where the clock value will reside, will appear exactly once per block, after the final entry batch (see SIMD 0307).

Could we mention SIMD 0307 + specify that it's specifically the block_producer_time_nanos: u64 field in the footer that will store the clock you're referring to in this SIMD?

We could decide to have the clock always in the last slice (if we have other information that always goes in the last slice). If this is the way to go, then we can describe it here as well.

The clock won't always be in the last slice / FEC Set, but rather in the last block marker, which will eventually span multiple FEC Sets.

This is why I'm suggesting that we place parent block information into the footer as part of this SIMD - if (1) the clock and (2) the final parent are both included within the same shred, we can run the clock check in this SIMD exactly once, directly in shred ingest, prior to replay.

@rogerANZA
Copy link
Contributor Author

rogerANZA commented Sep 23, 2025

This is why I'm suggesting that we place parent block information into the footer as part of this SIMD

Sure, footer is okay. In fact, line 40 already says footer.

@ksn6
Copy link
Contributor

ksn6 commented Sep 23, 2025

Sure, footer is okay. In fact, line 40 already says footer.

I think there's a bit of confusion here. Yes, we're in agreement that the clock should go into the footer; e.g., SIMD 0307 specifies this. This isn't what I'm referring to, though.

To clarify - at the moment, the only fields included in the block footer are:

  • user_agent: Vec<u8> denoting the validator user agent (e.g., Agave / Firedancer / etc., with mods)
  • block_producer_time_nanos: u64: the clock field that this SIMD aims to impose constraints on

I'm saying that, in this SIMD, we should consider proposing a new third field to the block footer of type (Slot, Hash) which denotes the (parent, block ID) of the block associated with this block footer. This third field will allow us to impose the clock check on shred ingest.

@rogerANZA
Copy link
Contributor Author

SIMD 0307 specifies this. [...] I'm saying that, in this SIMD, we should consider proposing a new third field to the block footer of type (Slot, Hash) which denotes the (parent, block ID) of the block associated with this block footer. This third field will allow us to impose the clock check on shred ingest.

I don't disagree. But this should rather be in SIMD 307.

@ksn6
Copy link
Contributor

ksn6 commented Sep 24, 2025

FYI - after a few conversations, looks like we'll be punting on placing a third field in the block footer denoting the parent. We plan on accomplishing this via other means (can elaborate if there's interest) in later work.

@rogerANZA
Copy link
Contributor Author

I'm saying that, in this SIMD, we should consider proposing a new third field to the block footer of type (Slot, Hash) which denotes the (parent, block ID) of the block associated with this block footer. This third field will allow us to impose the clock check on shred ingest.

Okay, now I understand your argument.

Why only include the parent and the slot of the parent, and not also the actual time of that parent slot? Then our check is even easier because all the data is already there...

Would you agree that this is a slippery slope? Including the same information a second time is problematic in my opinion. Now we additionally would have to check that the second inclusion of the information is equal to the first inclusion of the information. What if not? Then the block is still skipped?

@rogerANZA
Copy link
Contributor Author

To include the parent (again) is reasonable for repair. But this is orthogonal to this clock SIMD, isn't it?

Comment on lines +105 to +106
## Security Considerations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming my understanding is correct that:

  • If no blocks are produced then the next leader can set the time to be the same as last (no change) or up to 2 times time since last block;
  • OR; If no slots are produced (no skips) that the block will simply just halt with no chance for the next leader to place the correct time.

Then I think the following risk scenario is worth considering:

  • Money market (MM), let say marginfi, uses pyth pull oracles where a user is able to update the oracle and then use the MM.
  • Chain halts for 1 hour, now the clock is 1 hour behind, imagine the chain halt coincided with market volatility.
  • Attacker uses 1 hour old pyth payload to open insolvent positions on MM.

This is an example where the clock is not off by a few seconds but can be off by hours. In this scenario most current MMs would be vulnerable as they use the following code:

        let price = price_feed_account
            .get_price_no_older_than_with_custom_verification_level(
                clock,
                max_age,
                feed_id,
                MIN_PYTH_PUSH_VERIFICATION_LEVEL,
            )
            .map_err(|e| {
                debug!("Pyth push oracle error: {:?}", e);
                let error: MarginfiError = e.into();
                error
            })?;

Which in turn relies on the following check:

        check!(
            price
                .publish_time
                .saturating_add(maximum_age.try_into().unwrap())
                >= clock.unix_timestamp,
            GetPriceError::PriceTooOld
        );

This check will be completely broken until the clock catches up, allowing stale prices to be pushed.

Caveat

The same issue exists in the current clock which I believe will have the price for block N use the votes from N-1 which of course will be pre halt and thus stale. That said, it will correct within a few slots as opposed to this clock which will have a much longer vulnerability window.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming my understanding is correct that:

  • If no blocks are produced then the next leader can set the time to be the same as last (no change) or up to 2 times time since last block;

Basically, yes. However, the new value has to be strictly higher (just by 1 tick though).

... Chain halts for 1 hour ...

Wait-what? So you're saying that the whole chain is down for a full hour? None of expected 9,000 blocks in that hour appended? I would say that in this case we have much bigger problems, don't we?

What we could do of course is to narrow the time the leader can choose in such a case, in the most extreme case even narrow it down to basically 1 hour +/- 400 ms. This is an oversimplification because if the chain was really down for 1 hour, our emergency protocol would kick in, and slots would eventually get exponentially longer)

But, essentially, we could change the formula, and give the leader a narrower window of choice if the chain was down for a very long time.

Would that make it better?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the core protocol is sound and desirable in 99.99% of operation. In the 0.01% where we have a chain halt it would be ideal if:

  • The protocol restarted with the correct time
  • OR; There was a way for onchain applications to detect that a halt/emergency mode had occurred

The former of the two is desirable as it protects applications that rely on semi accurate time without those applications needing to change their logic.

So you're saying that the whole chain is down for a full hour? None of expected 9,000 blocks in that hour appended? I would say that in this case we have much bigger problems, don't we?

It's entirely feasible that the Solana blockchain halts during a liquidation cascade that promptly reverts. Now if money markets come back online after the recovery they should have minimal bad debt. However, if the clock lags then these money markets will observe all of those prices as the clock catches up. In this case an attacker will be able to open positions at the maximally depressed prices resulting in much more bad debt and potentially blowing up all on chain lending protocols in the worst case.

The reason chain halts are bad are:

  1. Continuous services stop (CLOBs, payments). not scary
  2. Protocol assumptions breakdown (liquidations, oracle update timeliness). very scary

Unfortunately the only way to address 2.liquidations is to have 100% uptime. However, we can address oracle update timeliness by providing an accurate post restart clock.

But, essentially, we could change the formula, and give the leader a narrower window of choice if the chain was down for a very long time.

Would that make it better?

This sounds perfect, how easy is this to achieve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds perfect, how easy is this to achieve?

Not difficult. The question is how exactly we would be doing it.

  1. The simplest way is narrowing the window (as suggested in the previous answer).

There might be more elaborate ways (also easy to implement):

  1. For instance, after a long outage without any blocks, have the next 3 or 5 (or your favorite odd number) leaders propose a time and we take the median as a new starting time. But there are downsides to this solution, in particular we would have 3 (or 5) blocks without a good time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE solution 1: For a 1 hour outage, you say that choosing a new time in [epsilon,2 hours] is too much freedom. What makes most sense? [1 hour - delta, 1 hour + delta] for what delta? What's the maximum delta that still makes sense? (A larger delta is better because it increases the chances that the actual time is in the interval.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess the question might boil down to, is it worse to:

  • Take a dependency on NTP (but only during a restart/major slot skip).
  • OR; Accept it's possible for the clocking to be 30m+ out of sync while its catching up.

If we take a dependency on NTP conditional on some critical failure having occurred prior, what do you see as the increased risk to the protocol? I would assume we now have a risk that a % of our validators have poisoned NTP and thus we cannot restart the protocol until the NTP issues are manually corrected/overridden?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dependency on NTP is a big NO from my side. It directly affects consensus, and I cannot prove correctness of consensus anymore. And the 30 minute skew you only get after a 30 minute outage (which we will never have, fingers crossed).

BUT: It's good that you raised the question, and that we are aware of it. I will (eventually) think more seriously about it. At this point, I would suggest that we eventually have a special program which can be used to forward the blockchain time. Anybody can suggest to vote to jump to a future time, and then all the stakers can suggest a time, and then we take the median of that. If enough stake participates (send their time vote to the program in the allowed time frame), we jump there. I don't think we need to immediately implement this, but it's good to have the thought ready if we need it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that sounds fair. Only thing to flag is that it would be ideal if the clock time was fixed before any defi transactions were processed. For instance if it took 32 slots (random number i chose) to fix the clock then ideally for the first 32 slots no user txs are processed to prevent the looting of vulnerable defi protocols.

Though agreed, ideally we don't have a 30 minute outage ever again... or at least it doesn't coincide with large market volatility

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could try to expose this information to the protocols and have the vulnerable protocols do this logic of dismissing all transactions for some time. But I would definitely not want to halt all user transactions just because some protocols have a bug.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a way to expose that the clock is/may be out of sync to the application layer then I agree that would solve this. My fear was that doing so would be roughly equivalent to reaching consensus on the clock time itself but if that's not the case then perfect.

@ksn6
Copy link
Contributor

ksn6 commented Oct 29, 2025

@rogerANZA here are implementation details we'll want to include in the SIMD:

Implementation

In proposing a block, a leader will include a special marker called a "block footer," which stores a UNIX timestamp (in nanoseconds). As of the writing of this SIMD, the block_producer_time_nanos field of BlockFooterV1 stores the clock:

/// Version 1 block footer containing production metadata.
///
/// The user agent bytes are capped at 255 bytes during serialization to prevent
/// unbounded growth while maintaining reasonable metadata storage.
///
/// # Serialization Format
/// ```text
/// ┌─────────────────────────────────────────┐
/// │ Producer Time Nanos          (8 bytes)  │
/// ├─────────────────────────────────────────┤
/// │ User Agent Length            (1 byte)   │
/// ├─────────────────────────────────────────┤
/// │ User Agent Bytes          (0-255 bytes) │
/// └─────────────────────────────────────────┘
/// ```
#[derive(Clone, PartialEq, Eq, Debug)]
pub struct BlockFooterV1 {
    pub block_producer_time_nanos: u64,
    pub block_user_agent: Vec<u8>,
}

Upon receiving a block footer from a leader, non-leader validators will:

  • Locally update the clock sysvar associated with the bank
  • During replay, validate the bounds of this clock sysvar with respect to the parent bank's clock sysvar and proceed as outlined in "Detailed Design"

Separately, could you add my name to the authors list please (ksn6)? Now that we have the ability to disseminate block components within the Alpenglow prototype, I'll be working on the above items to implement the clock checks.

@ksn6
Copy link
Contributor

ksn6 commented Oct 29, 2025

We could try to expose this information to the protocols and have the vulnerable protocols do this logic of dismissing all transactions for some time. But I would definitely not want to halt all user transactions just because some protocols have a bug.

If there's a way to expose that the clock is/may be out of sync to the application layer then I agree that would solve this. My fear was that doing so would be roughly equivalent to reaching consensus on the clock time itself but if that's not the case then perfect.

@OliverNChalk @qkniep - any further updates on this?

In the interest of getting this SIMD across the finish line, I'd strongly prefer coming up with a solution to this in a follow-up SIMD, while keeping this SIMD about the design for when the chain is online.

@OliverNChalk
Copy link

Works for me - happy for @qkniep to propose exposing additional information to DeFi apps in a subsequent proposal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants