Fix timing race condition in AwaitAssertAsync causing premature timeouts#7976
Conversation
Fixed a check-then-act race condition in TestKit's AwaitAssertAsync that could cause tests to timeout prematurely with only 1 retry attempt instead of the expected number of retries within the timeout window. The bug was introduced in PR akkadotnet#4075 (Dec 2019) when async API was added to TestKit. The timeout check occurred BEFORE Task.Delay(), creating a timing window where thread scheduling delays, GC pauses, or system load could cause the actual elapsed time to exceed the timeout even though the pre-check indicated a retry should occur. Changes: - Move timeout check to AFTER Task.Delay() in both AwaitAssertAsync overloads - Changed boundary condition from >= to > to allow final retry when at exact timeout - Re-run assertion on timeout to propagate the actual exception instead of generic timeout - Added explanatory comments documenting the race condition This fixes flaky test failures in cluster tests where initialization under load would take longer than expected, causing false timeout failures. Fixes StartEntitySpec.StartEntity_while_the_entity_is_waiting_for_restart_should_restart_it_immediately
de11f30 to
5a234b6
Compare
Aaronontheweb
left a comment
There was a problem hiding this comment.
Detailed my changes
| catch(Exception) | ||
| { | ||
| var stopped = Now + t; | ||
| if (stopped >= stop) |
There was a problem hiding this comment.
The problem here is that we don't get 1 last chance to test the assertion if the first attempt failed and took too long
| await Task.Delay(t, cancellationToken); | ||
|
|
||
| // Check if we've exceeded the timeout AFTER sleeping | ||
| if (Now > stop) |
There was a problem hiding this comment.
Now, after the delay has been completed, even if we're overdue on time we still check one final time on exit.
…he_entity_is_waiting_for_restart_should_restart_it_immediately
| if (Now > stop) | ||
| { | ||
| Sys.Log.Warning("AwaitAssert failed, timeout [{0}] is over after [{1}] attempts and [{2}] elapsed time", max, attempts, Now - start); | ||
| // Re-run the assertion one final time to get the actual exception | ||
| await assertion(); | ||
| } |
There was a problem hiding this comment.
My only problem is that this code logs a fail warning regardless if the extra assertion invocation failed or not.
There was a problem hiding this comment.
Yeah we should remove that
There was a problem hiding this comment.
There were some other errors with this code too, like if the final assertion succeeded we'd end up running again
…he_entity_is_waiting_for_restart_should_restart_it_immediately
…he_entity_is_waiting_for_restart_should_restart_it_immediately
Pull request was closed
Summary
Fixed a 6-year-old check-then-act race condition in TestKit's
AwaitAssertAsyncthat could cause tests to timeout prematurely with only 1 retry attempt instead of the expected number of retries within the timeout window.Problem
The bug was introduced in PR #4075 (Dec 2019) when async API was added to TestKit. The timeout check occurred BEFORE
Task.Delay(), creating a timing window where thread scheduling delays, GC pauses, or system load could cause the actual elapsed time to exceed the timeout even though the pre-check indicated a retry should occur.This manifested as flaky test failures like:
Only 1 attempt in 4.6 seconds instead of ~30 attempts in 3 seconds.
Changes
Task.Delay()in bothAwaitAssertAsyncoverloads>=to>to allow final retry when at exact timeoutImpact
This benefits all tests across the entire Akka.NET suite that use
AwaitAssert(), preventing false timeout failures under load.Testing
Verified that
StartEntitySpec.StartEntity_while_the_entity_is_waiting_for_restart_should_restart_it_immediatelynow passes with the default 3-second timeout (previously required 10 seconds as a workaround).