Skip to content

Fix test flake related to slow inventory collection #8763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Aug 4, 2025

This fixes #8756 by setting the NTP and CRDB timeouts to "five seconds".

As documented in #8756 (comment), we are hitting timeouts in inventory collection. We are hitting worst-case behavior with the CRDB admin service and the NTP admin service on Helios for still-unknown reasons.

We still need to work on #8762, but this PR should mitigate most of the observed flakes.

@smklein
Copy link
Collaborator Author

smklein commented Aug 4, 2025

@plotnick - with this PR, cargo nt --no-fail-fast -j32 -- crucible is reliably passing in a loop for me.

I assume it's still hitting timeouts for some services, as tracked by #8762, but this should no longer be causing the whole test suite to fail.

Copy link
Contributor

@plotnick plotnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Draft comment accidentally posted while GitHub was freaking out.)

@davepacheco
Copy link
Collaborator

It seems like this timeout applies to all inventory collection. Couldn't this be problematic for real systems?

@smklein
Copy link
Collaborator Author

smklein commented Aug 5, 2025

This is definitely a huge improvement, thank you for digging into it! Seems like there are still some timing-dependent or otherwise flaky tests (I'm still getting at least one failure on each full run), but this definitely fixes the large numbers of failures I was seeing previously.

#8762 is still open for this reason - it definitely merits more investigation - but IMO the source of hitting the "30 second timeout" was the long timeout on CRDB, and the longer timeout waiting for NTP.

It seems like this timeout applies to all inventory collection. Couldn't this be problematic for real systems?

It could. But these timeouts are still five seconds long - do you think there's going to be a significant difference in prod between 5 vs 15 seconds to contact one of these services?

I made this change mostly to help deflake main, while we still dig into #8762 to try to uncover the real issue. I'm also concerned about that one hitting us in prod - whatever the root cause is for our tests, it seems plausible we could also hit our worst-case timeout, which would be unbounded.

At a minimum, I think it's necessary to set "any timeout" on the progenitor client contacting NTP. If we think the 5 vs 15 second difference is critical, we can keep using 15 seconds, but we'll keep seeing test flake.

Copy link
Contributor

@plotnick plotnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question as Dave (will this negatively impact production systems?), but can report this is a huge improvement locally. With this change, I'm back to pre-#8603 failure levels on my test system, which is usually one or two failures (or sometimes none) per full run; and I can run just the crucible tests consistently with no failures. I think we still have some timing issues in the tests, but this definitely fixes the primary failure we were seeing on Helios. Thanks for digging into it!

@davepacheco
Copy link
Collaborator

Yeah, I'm pretty worried 5 seconds wouldn't be enough in real situations and that even 15 might not be enough. I can see the need to eliminate flakes, so I guess we should prioritize #8603. But maybe we should wait for R16 to ship (or, rather, for the gates to open for R17) before landing this so we don't introduce this risk right before shipping.

@smklein
Copy link
Collaborator Author

smklein commented Aug 6, 2025

With the discoveries in #8762, I'm not going to proceed with this PR.

As a short-term fix: If you see flakiness locally, you can run pfexec ipadm set-prop -t -p _rst_sent_rate=100000 tcp (or to some other high number, and drop the -t flag to make it persist if you're brave).

As for main, and the jobs running through buildomat - do y'all have strong opinions? Should the test suite be changing this value, or do y'all just want to live with the flake until we get optimizations for this localhost network traffic landed in Helios?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Many tests flaky after #8603
3 participants