Skip to content

Collection of test fixes (2025Q2, batch 2) (backport #14310) #14371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented Aug 12, 2025

This pull request addresses the test flakes that appeared in the past couple months. This is a follow-up to #14206 for failures that were not detected as part of the first pull request.


This is an automatic backport of pull request #14310 done by Mergify.

[Why]
If we use the list of reachable nodes, it includes nodes which are
currently booting. Trying to start vhost during their start can disturb
their initialization and has a great chance to fail anyway.

(cherry picked from commit d154e3d)
[Why]
In CI, it fails to fetch dependencies quite frequently, probably due to
some proxy in GitHub Actions.

(cherry picked from commit 4c8835f)
[Why]
This doesn't replicate the common_test logs layout, but it will be good
enough to let our GitHub Actions workflow to upload the logs without
specific instructions in the workflow.

(cherry picked from commit 8d0f100)
…mit/1`

[Why]
It looks to be too short in CI, causing failures from time to time.

(cherry picked from commit fd4c365)
…t of connections

[Why]
In CI, we sometimes observe two tracked connections in the return value.
I don't know yet what they are. Could it be a client that reopened its
crashed connection and because stats are updated asynchronously, we get
two tracked connections for a short period of time?

(cherry picked from commit 53d0b14)
…nections

[Why]
In CI, we sometimes observe two tracked connections in the return value.
I don't know yet what they are. Could it be a client that reopened its
crashed connection and because stats are updated asynchronously, we get
two tracked connections for a short period of time?

(cherry picked from commit ed1cdb5)

# Conflicts:
#	deps/rabbit/test/per_user_connection_tracking_SUITE.erl
[Why]
In CI, we observed failures where the sender runs out of credits and
don't expect that.

[How]
The `amqp_utils:send_messages/3` function already takes care of that.
Move this logic to a `send_message/2` function and use it in
`send_messages/3` and prevriously direct uses of
`amqp10_client:send_msg/2`.

(cherry picked from commit ef9f59c)

# Conflicts:
#	deps/rabbit/test/amqp_filter_sql_SUITE.erl
…n CI

[Why]
The `stream_pub_sub_metrics` test failed at least once in CI because the
`rabbitmq_stream_consumer_max_offset_lag` was 4 instead of the expected
3 on line 815.

I couldn't reproduce the problem so far.

[How]
The test case now logs the initial value of that metric at the beginning
of the test function. Hopefully this will give us some clue for the day
it fails again.

(cherry picked from commit 2bc8d11)
[Why]
I wonder if a previous test interferes with the metrics verified by this
test case. To be safer, execute it first and let's see what happens.

(cherry picked from commit 2674456)

# Conflicts:
#	deps/rabbitmq_prometheus/test/rabbit_prometheus_http_SUITE.erl
…licitly

[Why]
In CI, we observe that the channel hangs sometimes.
rabbitmq_ct_client_helpers implicit connection is quite fragile in the
sense that a test case can disturb the next one in some cases.

[How]
Let's use a dedicated connection and see if it fixes the problem.

(cherry picked from commit 17feaa1)
…ag that never existed

[Why]
The `rabbit_consistent_hash_exchange_raft_based_metadata_store` does not
seem to be a feature flag that ever existed according to the git
history. This causes the test case to always be skipped.

[How]
Simply remove the statement that enables this ghost feature flag.

(cherry picked from commit ea2689f)
[Why]
Maven took ages to fetch dependencies at least once in CI. The testsuite
failed because it reached the time trap limit.

[How]
Increase it from 2 to 5 minutes.

(cherry picked from commit 19ed249)
[Why]
It didn't handle them before and crashed later when it assumed the
return value was a list.

(cherry picked from commit 8bdbb0f)

# Conflicts:
#	deps/rabbitmq_mqtt/test/mqtt_shared_SUITE.erl
[Why]
The reason is the same as for commit
ffaf919. It should have been part of it
in fact, so an oversight from my end.

(cherry picked from commit 56b59c3)
… anonymous functions

[Why]
Before this change, when the `idle_time_out_on_server/1` test case was runned first in the
shuffled test group, the test module was not loaded on the remote broker.
When the anonymous function was passed to meck and was executed, we got
the following crash on the broker:

    crasher:
      initial call: rabbit_heartbeat:'-heartbeater/2-fun-0-'/0
      pid: <0.704.0>
      registered_name: []
      exception error: {undef,
                           [{#Fun<amqp_client_SUITE.14.116163631>,
                             [#Port<0.45>,[recv_oct]],
                             []},
                            {rabbit_heartbeat,get_sock_stats,3,
                                [{file,"rabbit_heartbeat.erl"},{line,175}]},
                            {rabbit_heartbeat,heartbeater,3,
                                [{file,"rabbit_heartbeat.erl"},{line,155}]},
                            {proc_lib,init_p,3,
                                [{file,"proc_lib.erl"},{line,317}]},
                            {rabbit_net,getstat,[#Port<0.45>,[recv_oct]],[]}]}

This led to a failure of the test case later, when it waited for a
message from the connecrtion.

We do the same in two other test cases where this is likely to happen
too.

[How]
Loading the module first fixes the problem.

(cherry picked from commit bd1978c)
[Why]
Relying on the return value of the queue deletion is fragile because the
policy is cleared asynchronously.

[How]
We now wait for the queues to reach the expected queue length, then we
delete them and ensure the length didn't change.

(cherry picked from commit efdec84)
[Why]
There is a frequent failure in CI and the fact that all test cases use
the same resource names does not help with debugging.

(cherry picked from commit 5936b3b)
[Why]
This should also help debug the failures we get in CI.

(cherry picked from commit fda663d)
[Why]
It failed at least once in CI. It should help us understand what went
on.

(cherry picked from commit 0a643ef)
[Why]
It didn't handle them before and crashed later when it assumed the
return value was a list.

(cherry picked from commit 5c1456b)

# Conflicts:
#	deps/rabbitmq_mqtt/test/auth_SUITE.erl
... when testing user limits

[How]
This is the same fix as the one for the vhost limits test case made in
commit 5aab965.

While here, fix a compiler warning about an unused variable.

(cherry picked from commit 02b1561)

# Conflicts:
#	deps/rabbitmq_mqtt/test/auth_SUITE.erl
…se2`

[Why]
The connection is about to be killed at the end of the test case. It's
not necessary to close it explicitly.

Moreover, on a slow environment like CI, the connection process might
have already exited when the test case tries to close it. In this case,
it fails with a `noproc` exception.

(cherry picked from commit 0601ef4)
[Why]
`gen_tcp:close/1` simply closes the connection and doesn't wait for the
broker to handle it. This sometimes causes the next test to fail
because, in addition to that test's new connection, there is still the
previous one's process still around waiting for the broker to notice the
close.

[How]
We now wait for the connection to be closed at the end of a test case,
and wait for the connection list to have a single element when we want
to query the connnection name.

(cherry picked from commit eb8f631)
[Why]
It didn't handle them before and crashed later when it assumed the
return value was a list.

(cherry picked from commit 0e36184)

# Conflicts:
#	deps/rabbitmq_mqtt/test/cluster_SUITE.erl
…pic_dest`

[Why]
The `test_topic_dest` test case fails from time to time in CI. I don't
know why as there are no errors logged anywhere. Let's assume it's a
timeout a bit too short.

While here, apply the same change to `test_exchange_dest`.

(cherry picked from commit 5f520b8)
[Why]
I still don't know what causes the transient failures in this testsuite.
The AMQP connection is closed asynchronously, therefore the next test
case is running when it finishes to close. I have no idea if it causes
troubles, but it makes the broker logs more difficult to read.

(cherry picked from commit 766ca19)
[Why]
I noticed the following error in a test case:

    error sending frame
    Traceback (most recent call last):
      File "/home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbitmq_stomp/test/python_SUITE_data/src/deps/stomp/transport.py", line 623, in send
        self.socket.sendall(encoded_frame)
    OSError: [Errno 9] Bad file descriptor

When the test suite succeeds, this error is not present. When it failed,
it was present. But I checked only one instance of each, it's not enough
to draw any conclusion about the relationship between this error and the
failing test case later.

I have no idea which test case hits this error, so increase the
verbosity, in the hope we see the name of the test case running at the
time of this error.

(cherry picked from commit 5bfb7bc)
@mergify mergify bot added the conflicts label Aug 12, 2025
Copy link
Author

mergify bot commented Aug 12, 2025

Cherry-pick of ed1cdb5 has failed:

On branch mergify/bp/v4.1.x/pr-14310
Your branch is ahead of 'origin/v4.1.x' by 5 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit ed1cdb598.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   deps/rabbit/test/per_user_connection_tracking_SUITE.erl

no changes added to commit (use "git add" and/or "git commit -a")

Cherry-pick of ef9f59c has failed:

On branch mergify/bp/v4.1.x/pr-14310
Your branch is ahead of 'origin/v4.1.x' by 6 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit ef9f59c58.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   deps/rabbit/test/amqp_utils.erl

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	deleted by us:   deps/rabbit/test/amqp_filter_sql_SUITE.erl

Cherry-pick of 2674456 has failed:

On branch mergify/bp/v4.1.x/pr-14310
Your branch is ahead of 'origin/v4.1.x' by 8 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit 267445680.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   deps/rabbitmq_prometheus/test/rabbit_prometheus_http_SUITE.erl

no changes added to commit (use "git add" and/or "git commit -a")

Cherry-pick of 8bdbb0f has failed:

On branch mergify/bp/v4.1.x/pr-14310
Your branch is ahead of 'origin/v4.1.x' by 14 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit 8bdbb0fc2.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   deps/rabbitmq_mqtt/test/mqtt_shared_SUITE.erl

no changes added to commit (use "git add" and/or "git commit -a")

Cherry-pick of 5c1456b has failed:

On branch mergify/bp/v4.1.x/pr-14310
Your branch is ahead of 'origin/v4.1.x' by 21 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit 5c1456b2d.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   deps/rabbitmq_mqtt/test/auth_SUITE.erl

no changes added to commit (use "git add" and/or "git commit -a")

Cherry-pick of 02b1561 has failed:

On branch mergify/bp/v4.1.x/pr-14310
Your branch is ahead of 'origin/v4.1.x' by 22 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit 02b156155.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   deps/rabbitmq_mqtt/test/auth_SUITE.erl

no changes added to commit (use "git add" and/or "git commit -a")

Cherry-pick of 0e36184 has failed:

On branch mergify/bp/v4.1.x/pr-14310
Your branch is ahead of 'origin/v4.1.x' by 25 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit 0e36184a6.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   deps/rabbitmq_mqtt/test/cluster_SUITE.erl

no changes added to commit (use "git add" and/or "git commit -a")

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mkuratczyk
Copy link
Contributor

many cherry-picking failures. this will need to be a separate PR for v4.1.x

@mkuratczyk mkuratczyk closed this Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants