-
Notifications
You must be signed in to change notification settings - Fork 51
Fix test flake related to slow inventory collection #8763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Draft comment accidentally posted while GitHub was freaking out.)
It seems like this timeout applies to all inventory collection. Couldn't this be problematic for real systems? |
#8762 is still open for this reason - it definitely merits more investigation - but IMO the source of hitting the "30 second timeout" was the long timeout on CRDB, and the longer timeout waiting for NTP.
It could. But these timeouts are still five seconds long - do you think there's going to be a significant difference in prod between 5 vs 15 seconds to contact one of these services? I made this change mostly to help deflake At a minimum, I think it's necessary to set "any timeout" on the progenitor client contacting NTP. If we think the 5 vs 15 second difference is critical, we can keep using 15 seconds, but we'll keep seeing test flake. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question as Dave (will this negatively impact production systems?), but can report this is a huge improvement locally. With this change, I'm back to pre-#8603 failure levels on my test system, which is usually one or two failures (or sometimes none) per full run; and I can run just the crucible tests consistently with no failures. I think we still have some timing issues in the tests, but this definitely fixes the primary failure we were seeing on Helios. Thanks for digging into it!
Yeah, I'm pretty worried 5 seconds wouldn't be enough in real situations and that even 15 might not be enough. I can see the need to eliminate flakes, so I guess we should prioritize #8603. But maybe we should wait for R16 to ship (or, rather, for the gates to open for R17) before landing this so we don't introduce this risk right before shipping. |
With the discoveries in #8762, I'm not going to proceed with this PR. As a short-term fix: If you see flakiness locally, you can run As for |
This fixes #8756 by setting the NTP and CRDB timeouts to "five seconds".
As documented in #8756 (comment), we are hitting timeouts in inventory collection. We are hitting worst-case behavior with the CRDB admin service and the NTP admin service on Helios for still-unknown reasons.
We still need to work on #8762, but this PR should mitigate most of the observed flakes.