Fix test flake related to slow inventory collection #8763

smklein · 2025-08-04T23:26:25Z

This fixes #8756 by setting the NTP and CRDB timeouts to "five seconds".

As documented in #8756 (comment), we are hitting timeouts in inventory collection. We are hitting worst-case behavior with the CRDB admin service and the NTP admin service on Helios for still-unknown reasons.

We still need to work on #8762, but this PR should mitigate most of the observed flakes.

smklein · 2025-08-04T23:27:37Z

@plotnick - with this PR, cargo nt --no-fail-fast -j32 -- crucible is reliably passing in a loop for me.

I assume it's still hitting timeouts for some services, as tracked by #8762, but this should no longer be causing the whole test suite to fail.

plotnick

(Draft comment accidentally posted while GitHub was freaking out.)

davepacheco · 2025-08-05T16:02:47Z

It seems like this timeout applies to all inventory collection. Couldn't this be problematic for real systems?

smklein · 2025-08-05T16:11:42Z

This is definitely a huge improvement, thank you for digging into it! Seems like there are still some timing-dependent or otherwise flaky tests (I'm still getting at least one failure on each full run), but this definitely fixes the large numbers of failures I was seeing previously.

#8762 is still open for this reason - it definitely merits more investigation - but IMO the source of hitting the "30 second timeout" was the long timeout on CRDB, and the longer timeout waiting for NTP.

It seems like this timeout applies to all inventory collection. Couldn't this be problematic for real systems?

It could. But these timeouts are still five seconds long - do you think there's going to be a significant difference in prod between 5 vs 15 seconds to contact one of these services?

I made this change mostly to help deflake main, while we still dig into #8762 to try to uncover the real issue. I'm also concerned about that one hitting us in prod - whatever the root cause is for our tests, it seems plausible we could also hit our worst-case timeout, which would be unbounded.

At a minimum, I think it's necessary to set "any timeout" on the progenitor client contacting NTP. If we think the 5 vs 15 second difference is critical, we can keep using 15 seconds, but we'll keep seeing test flake.

plotnick

I have the same question as Dave (will this negatively impact production systems?), but can report this is a huge improvement locally. With this change, I'm back to pre-#8603 failure levels on my test system, which is usually one or two failures (or sometimes none) per full run; and I can run just the crucible tests consistently with no failures. I think we still have some timing issues in the tests, but this definitely fixes the primary failure we were seeing on Helios. Thanks for digging into it!

davepacheco · 2025-08-05T16:58:16Z

Yeah, I'm pretty worried 5 seconds wouldn't be enough in real situations and that even 15 might not be enough. I can see the need to eliminate flakes, so I guess we should prioritize #8603. But maybe we should wait for R16 to ship (or, rather, for the gates to open for R17) before landing this so we don't introduce this risk right before shipping.

smklein · 2025-08-06T22:42:44Z

With the discoveries in #8762, I'm not going to proceed with this PR.

As a short-term fix: If you see flakiness locally, you can run pfexec ipadm set-prop -t -p _rst_sent_rate=100000 tcp (or to some other high number, and drop the -t flag to make it persist if you're brave).

As for main, and the jobs running through buildomat - do y'all have strong opinions? Should the test suite be changing this value, or do y'all just want to live with the flake until we get optimizations for this localhost network traffic landed in Helios?

Debugging inventory

ae31940

plotnick approved these changes Aug 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix test flake related to slow inventory collection #8763

Fix test flake related to slow inventory collection #8763

Uh oh!

smklein commented Aug 4, 2025 •

edited

Loading

Uh oh!

smklein commented Aug 4, 2025

Uh oh!

plotnick left a comment •

edited

Loading

Uh oh!

davepacheco commented Aug 5, 2025

Uh oh!

smklein commented Aug 5, 2025

Uh oh!

plotnick left a comment

Uh oh!

davepacheco commented Aug 5, 2025

Uh oh!

smklein commented Aug 6, 2025

Uh oh!

Uh oh!

Fix test flake related to slow inventory collection #8763

Are you sure you want to change the base?

Fix test flake related to slow inventory collection #8763

Uh oh!

Conversation

smklein commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smklein commented Aug 4, 2025

Uh oh!

plotnick left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco commented Aug 5, 2025

Uh oh!

smklein commented Aug 5, 2025

Uh oh!

plotnick left a comment

Choose a reason for hiding this comment

Uh oh!

davepacheco commented Aug 5, 2025

Uh oh!

smklein commented Aug 6, 2025

Uh oh!

Uh oh!

smklein commented Aug 4, 2025 •

edited

Loading

plotnick left a comment •

edited

Loading