Skip to content

Comments

Fix timing race condition in AwaitAssertAsync causing premature timeouts#7976

Closed
Aaronontheweb wants to merge 5 commits intoakkadotnet:devfrom
Aaronontheweb:claude-wt-StartEntitySpec.StartEntity_while_the_entity_is_waiting_for_restart_should_restart_it_immediately
Closed

Fix timing race condition in AwaitAssertAsync causing premature timeouts#7976
Aaronontheweb wants to merge 5 commits intoakkadotnet:devfrom
Aaronontheweb:claude-wt-StartEntitySpec.StartEntity_while_the_entity_is_waiting_for_restart_should_restart_it_immediately

Conversation

@Aaronontheweb
Copy link
Member

Summary

Fixed a 6-year-old check-then-act race condition in TestKit's AwaitAssertAsync that could cause tests to timeout prematurely with only 1 retry attempt instead of the expected number of retries within the timeout window.

Problem

The bug was introduced in PR #4075 (Dec 2019) when async API was added to TestKit. The timeout check occurred BEFORE Task.Delay(), creating a timing window where thread scheduling delays, GC pauses, or system load could cause the actual elapsed time to exceed the timeout even though the pre-check indicated a retry should occur.

This manifested as flaky test failures like:

Expected cluster.ReadView.Members.Count(m => m.Status == MemberStatus.Up) to be 1, but found 0.
AwaitAssert failed, timeout [00:00:03] is over after [1] attempts and [00:00:04.6625130] elapsed time

Only 1 attempt in 4.6 seconds instead of ~30 attempts in 3 seconds.

Changes

  • Move timeout check to AFTER Task.Delay() in both AwaitAssertAsync overloads
  • Changed boundary condition from >= to > to allow final retry when at exact timeout
  • Re-run assertion on timeout to propagate the actual exception instead of generic timeout
  • Added explanatory comments documenting the race condition

Impact

This benefits all tests across the entire Akka.NET suite that use AwaitAssert(), preventing false timeout failures under load.

Testing

Verified that StartEntitySpec.StartEntity_while_the_entity_is_waiting_for_restart_should_restart_it_immediately now passes with the default 3-second timeout (previously required 10 seconds as a workaround).

@Aaronontheweb Aaronontheweb added the akka-testkit Akka.NET Testkit issues label Dec 22, 2025
Fixed a check-then-act race condition in TestKit's AwaitAssertAsync that
could cause tests to timeout prematurely with only 1 retry attempt instead
of the expected number of retries within the timeout window.

The bug was introduced in PR akkadotnet#4075 (Dec 2019) when async API was added to
TestKit. The timeout check occurred BEFORE Task.Delay(), creating a timing
window where thread scheduling delays, GC pauses, or system load could
cause the actual elapsed time to exceed the timeout even though the
pre-check indicated a retry should occur.

Changes:
- Move timeout check to AFTER Task.Delay() in both AwaitAssertAsync overloads
- Changed boundary condition from >= to > to allow final retry when at exact timeout
- Re-run assertion on timeout to propagate the actual exception instead of generic timeout
- Added explanatory comments documenting the race condition

This fixes flaky test failures in cluster tests where initialization under
load would take longer than expected, causing false timeout failures.

Fixes StartEntitySpec.StartEntity_while_the_entity_is_waiting_for_restart_should_restart_it_immediately
@Aaronontheweb Aaronontheweb force-pushed the claude-wt-StartEntitySpec.StartEntity_while_the_entity_is_waiting_for_restart_should_restart_it_immediately branch from de11f30 to 5a234b6 Compare December 23, 2025 15:54
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) December 23, 2025 15:54
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailed my changes

catch(Exception)
{
var stopped = Now + t;
if (stopped >= stop)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that we don't get 1 last chance to test the assertion if the first attempt failed and took too long

await Task.Delay(t, cancellationToken);

// Check if we've exceeded the timeout AFTER sleeping
if (Now > stop)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, after the delay has been completed, even if we're overdue on time we still check one final time on exit.

…he_entity_is_waiting_for_restart_should_restart_it_immediately
Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code smells a bit

Comment on lines 123 to 128
if (Now > stop)
{
Sys.Log.Warning("AwaitAssert failed, timeout [{0}] is over after [{1}] attempts and [{2}] elapsed time", max, attempts, Now - start);
// Re-run the assertion one final time to get the actual exception
await assertion();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only problem is that this code logs a fail warning regardless if the extra assertion invocation failed or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should remove that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some other errors with this code too, like if the final assertion succeeded we'd end up running again

Aaronontheweb and others added 3 commits December 29, 2025 12:57
…he_entity_is_waiting_for_restart_should_restart_it_immediately
…he_entity_is_waiting_for_restart_should_restart_it_immediately
auto-merge was automatically disabled January 9, 2026 17:23

Pull request was closed

@Aaronontheweb Aaronontheweb deleted the claude-wt-StartEntitySpec.StartEntity_while_the_entity_is_waiting_for_restart_should_restart_it_immediately branch January 9, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

akka-testkit Akka.NET Testkit issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants