Skip to content

chore: Add doc and rename function for flushing strategy #740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

lym953
Copy link
Contributor

@lym953 lym953 commented Jul 11, 2025

Motivation

It took me quite some effort to understand flushing strategies. I want to make it easier to understand for me and future developers.

This PR

Tries to make flushing strategy code more readable:

  1. Add/move comments
  2. Create an enum ConcreteFlushStrategy, which doesn't contain Default because it is required to be resolved to a concrete strategy
  3. Rename should_adapt to evaluate_concrete_strategy()

To reviewers

There are still a few things I don't understand, which are marked with TODO. Appreciate explanation!
Also correct me if any comment I added is wrong.

@@ -8,13 +8,42 @@ pub struct PeriodicStrategy {

#[derive(Clone, Copy, Debug, PartialEq)]
pub enum FlushStrategy {
// Flush every 1s and at the end of the invocation
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some doc is moved from flush_control.rs

@@ -57,12 +52,15 @@ impl FlushControl {
i.set_missed_tick_behavior(Skip);
i
}
// TODO: Why is this 15 minutes?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need help here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the history/blame can be helpful here: #599

This timeout is used for the race flush block. But when users want to flush at end, we should never hit the race flush. Initially it was set to a max unsigned int but that caused tokio to integer overflow when calculating the timeout

@@ -40,22 +47,23 @@ impl InvocationTimes {
let should_adapt = (elapsed as f64 / (LOOKBACK_COUNT - 1) as f64) < ONE_TWENTY_SECONDS;
if should_adapt {
// Both units here are in seconds
// TODO: What does this mean?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need help here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment addresses an earlier misconception when the author (me) thought that the flush_timeout was in milliseconds, so the elapsed time was then in seconds and it never hit.

The logic says that if we should adapt to a periodic strategy (which is defined above on 47, if the last 20 requests arrived within 2 minutes). The next bit decides if we should use the continuous strategy or if we should use a regular periodic strategy, which from your comment above on line 37-41 seems you've got handled pretty well.

I hope this helps!

Copy link
Contributor Author

@lym953 lym953 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I was not clear. What I don't understand is why compare elapsed with flush_timeout.

For example, if flush_timeout is 30s, and there's one invocation every 1.6s, then elapsed will be 1.6s * 19 = 30.4s, and we will choose the synchronous "periodic" strategy with 20s interval. If flush takes 29s, then this will block lots of invocations.

Should we compare the average invocation interval (elapsed / (LOOKBACK_COUNT - 1)) with flush_timeout instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that we want to make sure the periodicity is fast enough that nonblocking flushing is unlikely to be broken by lambda pausing/freezing the CPU between invocations. Because the client will timeout after the flush_timeout, we make this comparison. We're just choosing 20 invocations arbitrarily to pick "very fast functions".

It's unrelated to what happens if a periodic flush takes 29s. If we choose the periodic strategy, we'll block on the call to /next until the periodic flush is complete. If we choose the continuous flush, we don't.

Copy link
Contributor Author

@lym953 lym953 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the overall idea that the threshold of avg interval should be positively correlated to flush_timeout, but I don't get why we directly compare elapsed (which is 19 * avg interval) with flush_timeout. I'm thinking of an alternative threshold, which more aggressively chooses the continuous strategy: (ignore int/float conversions and division errors in the code below)

let avg_interval = elapsed / (LOOKBACK_COUNT - 1);
if avg_interval * 4 < flush_timeout {
  return Continuously;
} else {
  return Periodically;
}

This is similar to the current code, except that we use 4 here instead of 19. About this threshold:

  1. Usually it's not likely that the actual interval between any two invocations is longer than 4 * avg_interval, so the freezing duration is unlikely longer than 4 * avg_interval, so the risk of flush timeout is small.
  2. On the up side, it brings the benefits of the continuous strategy for a wider range of avg_interval values, i.e. we will transition from Periodic to Continuous when it's in the range of (flush_timeout / 19, flush_timeout / 4), i.e. (1.58s, 7.5s) if flush_timeout == 30s.

What do you think?

We can talk in person tomorrow if that's easier.

@lym953 lym953 marked this pull request as ready for review July 11, 2025 20:43
@lym953 lym953 requested a review from a team as a code owner July 11, 2025 20:43
@lym953 lym953 force-pushed the yiming.luo/flush-strategy-doc branch from d280bea to 09e9c06 Compare July 14, 2025 18:58
@lym953
Copy link
Contributor Author

lym953 commented Jul 14, 2025

Need another review. I made evaluate_concrete_strategy() accept a FlushStrategy to avoid the unreachable!().

@lym953 lym953 requested a review from astuyve July 15, 2025 14:11
(elapsed as f64 / (LOOKBACK_COUNT - 1) as f64) < ONE_TWENTY_SECONDS;
if should_adapt {
// Both units here are in seconds
// TODO: What does this mean?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that answer your question (in the other thread)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants