Skip to content

Conversation

battermann
Copy link
Contributor

Checklist

  • Add a new entry in an appropriate subdirectory of changelog.d
  • Read and follow the PR guidelines

@zebot zebot added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Aug 28, 2025
@jschaul
Copy link
Member

jschaul commented Aug 28, 2025

  1. run the docker-compose services, either on latest commit; or on a previous commit using toxiproxy
  2. start the test: TEST_INCLUDE=testRabbitMQConnection make ci-safe package=integration
  3. when the prompt appears, break the connection in some way. Use either ./toxiproxy-rabbitmq-terminate.sh or a way with haproxy (you need to first fix the configuration, somehow the haproxy setup isn't quite working yet on that latest commit
  4. re-establish a connection (press enter in the toxiproxy script)
  5. press enter in the integration test so it attempts to send another message. If the test is green then, well, you could not reproduce an issue.

@jschaul jschaul force-pushed the WPB-19422-rabbit-mq-connection-loss-leads-to-backend-notification-pusher-getting-stuck-2 branch from 37a7252 to 835591a Compare August 28, 2025 15:49
Copy link
Contributor

@lwille lwille Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, we seem to be discretely doing things:

sending + consuming
(kill connection + wait for reconnect)
sending + consuming

I think that's not really what happens in the background worker, or in gundeck for forwarding notifications. Those processes would be waiting for new AMQP messages all the time, and suddenly the disconnect would kick.

What happens if the RabbitMQ connection is killed while sending/consuming messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's correct that we test that sending and receiving works, then kill RabbitMQ, restart it and wait for reconnect and then test sending and receiving again.

But I don't understand what you mean with "that's not what happens in background worker". The background worker is always connected to the queue, also when the outage starts. Or more precisely, it's a thread in the background worker, that should constantly and indefinitely try to reconnect or restart if killed.

Gundeck in not involved as from the logs we can see that it's the notification-pusher that while trying to establish a connection to RabbitMQ.

Trying to kill RabbitMQ while bg worker is consuming messages is an interesting idea. What we do is, we kill the broker, while the bg is connected, however there are no messages being processed when the connection is killed, because the queue is empty. Killing the broker while the queue contains unprocessed messages is technically a bit difficult because when we take RabbitMQ down we also cannot produce messages. Still maybe worthwhile to try to find a workaround and test this?

@battermann battermann force-pushed the WPB-19422-rabbit-mq-connection-loss-leads-to-backend-notification-pusher-getting-stuck-2 branch from 835591a to 082947a Compare September 1, 2025 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants