Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection again (#3278) #4635

SoloJacobs · 2025-10-27T17:59:21Z

Repeating my commit message for convenience:

Two commits have already merged in order to address the flakiness of this test.
However, I can still reproduce the issue using:

go test -failfast -run "TestClusterJoinAndReconnect/TestJoinLeave" -count 600 ./cluster

An easy way to increase the failure rate is to increase CPU load, e.g.,

yes > /dev/null &; yes > /dev/null &; yes > /dev/null &; yes > /dev/null &

On my machine the combination of these commands fails every time.
The underlying reason for the failure is that the test only waits for p2 to be ready, but this does not reflect whether p has updated its memberlist.

Edit due to the comments I got: ~~We can ensure that p has updated its memberlist by waiting for NotifyJoin to be called. The test is now slightly slower, 0.8 seconds on my machine.~~ The test now retries the assertions in question using Eventually.

Side-note

I am new to the project and feedback is very appreciated. It was hard to find something that avoids spin looping and does not change the API of Peer. Also, the WaitReady and Settle calls are redundant now, since we are actually waiting for NotifyJoin. But leaving them in does not hurt either.

Fixes #3287

SoloJacobs · 2025-10-27T18:01:12Z

@gotjosh I think you can review this best.

cluster/cluster.go

cluster/cluster_test.go

SuperQ

Great find! Thanks for fixing this.

Two commits have already merged in order to address the flakiness of this test. However, I can still reproduce the issue using: ```sh go test -failfast -run "TestClusterJoinAndReconnect/TestJoinLeave" -count 600 ./cluster ``` An easy way to increase the failure rate is to increase CPU load, e.g., ```sh yes > /dev/null &; yes > /dev/null &; yes > /dev/null &; yes > /dev/null & ``` On my machine the combination of these commands fails every time. The underlying reason for the failure is that the test only waits for `p2` to be ready, but this does not reflect whether `p` has updated its memberlist. The test now retries the assertions in question using `Eventually`. Fixes prometheus#3287 Signed-off-by: Solomon Jacobs <[email protected]> Don't modify cluster for TestClusterJoinAndReconnect/TestTLSConnection

sysadmind · 2025-11-05T14:48:30Z

I don't think that this resolved all of the flaky tests. https://github.com/prometheus/alertmanager/actions/runs/19104722659/job/54585342806

SoloJacobs · 2025-11-05T15:15:40Z

@sysadmind That seems very likely, I didn't play around with TestSetPeerNames. I will follow-up with a commit to apply the change to all the tests in cluster_test.go.

siavashs suggested changes Nov 5, 2025

View reviewed changes

cluster/cluster.go Outdated Show resolved Hide resolved

cluster/cluster_test.go Outdated Show resolved Hide resolved

SoloJacobs force-pushed the race-cond branch from 5a781aa to 2c4fea1 Compare November 5, 2025 10:31

siavashs approved these changes Nov 5, 2025

View reviewed changes

SuperQ approved these changes Nov 5, 2025

View reviewed changes

TheMeier approved these changes Nov 5, 2025

View reviewed changes

SoloJacobs force-pushed the race-cond branch from 2c4fea1 to 8e21913 Compare November 5, 2025 12:05

SuperQ merged commit 88b4e13 into prometheus:main Nov 5, 2025
7 checks passed

SoloJacobs deleted the race-cond branch November 5, 2025 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection again (#3278) #4635

Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection again (#3278) #4635

Uh oh!

SoloJacobs commented Oct 27, 2025 •

edited

Loading

Uh oh!

SoloJacobs commented Oct 27, 2025

Uh oh!

Uh oh!

Uh oh!

SuperQ left a comment

Uh oh!

Uh oh!

sysadmind commented Nov 5, 2025

Uh oh!

SoloJacobs commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection again (#3278) #4635

Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection again (#3278) #4635

Uh oh!

Conversation

SoloJacobs commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SoloJacobs commented Oct 27, 2025

Uh oh!

Uh oh!

Uh oh!

SuperQ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sysadmind commented Nov 5, 2025

Uh oh!

SoloJacobs commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SoloJacobs commented Oct 27, 2025 •

edited

Loading