Skip to content

Conversation

@SoloJacobs
Copy link
Contributor

@SoloJacobs SoloJacobs commented Oct 27, 2025

Repeating my commit message for convenience:

Two commits have already merged in order to address the flakiness of this test.
However, I can still reproduce the issue using:

go test -failfast -run "TestClusterJoinAndReconnect/TestJoinLeave" -count 600 ./cluster

An easy way to increase the failure rate is to increase CPU load, e.g.,

yes > /dev/null &; yes > /dev/null &; yes > /dev/null &; yes > /dev/null &

On my machine the combination of these commands fails every time.
The underlying reason for the failure is that the test only waits for p2 to be ready, but this does not reflect whether p has updated its memberlist.

Edit due to the comments I got: We can ensure that p has updated its memberlist by waiting for NotifyJoin to be called. The test is now slightly slower, 0.8 seconds on my machine. The test now retries the assertions in question using Eventually.

Side-note

I am new to the project and feedback is very appreciated. It was hard to find something that avoids spin looping and does not change the API of Peer. Also, the WaitReady and Settle calls are redundant now, since we are actually waiting for NotifyJoin. But leaving them in does not hurt either.

Fixes #3287

@SoloJacobs
Copy link
Contributor Author

@gotjosh I think you can review this best.

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! Thanks for fixing this.

Two commits have already merged in order to address the flakiness of this test.

However, I can still reproduce the issue using:
```sh
go test -failfast -run "TestClusterJoinAndReconnect/TestJoinLeave" -count 600 ./cluster
```
An easy way to increase the failure rate is to increase CPU load, e.g.,
```sh
yes > /dev/null &; yes > /dev/null &; yes > /dev/null &; yes > /dev/null &
```
On my machine the combination of these commands fails every time.

The underlying reason for the failure is that the test only waits for
`p2` to be ready, but this does not reflect whether `p` has updated its
memberlist. The test now retries the assertions in question using
`Eventually`.

Fixes prometheus#3287

Signed-off-by: Solomon Jacobs <[email protected]>

Don't modify cluster for TestClusterJoinAndReconnect/TestTLSConnection
@SuperQ SuperQ merged commit 88b4e13 into prometheus:main Nov 5, 2025
7 checks passed
@sysadmind
Copy link
Contributor

I don't think that this resolved all of the flaky tests. https://github.com/prometheus/alertmanager/actions/runs/19104722659/job/54585342806

@SoloJacobs
Copy link
Contributor Author

@sysadmind That seems very likely, I didn't play around with TestSetPeerNames. I will follow-up with a commit to apply the change to all the tests in cluster_test.go.

@SoloJacobs SoloJacobs deleted the race-cond branch November 5, 2025 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky Test: TestClusterJoinAndReconnect/TestJoinLeave flake

5 participants