Message expiration during graceful shutdown #411

spyroska · 2025-03-23T17:39:21Z

spyroska
Mar 23, 2025

Hello,
I would like to report an issue and submit a PR for your review. Issue description with minimal working example and proposed solution are presented below.

Issue Description

What is the issue?

Messages that have an expiration property https://www.rabbitmq.com/docs/ttl#per-message-ttl-in-publishers, can expire during hutch graceful shutdown.

What versions have been verified to be affected?

Verified in hutch v1.3.1 running on ruby v3.2 and v3.3.

Not tested with jruby.
Not tested with older hutch versions.

What messages are affected?

Not all messages; only those that have an expiration property and that also arrive after SIGTERM is received.

Studying the graceful shutdown flow, we see that, after handling the signal, control returns to Hutch::Worker and the Hutch::Broker is stopped https://github.com/ruby-amqp/hutch/blob/v1.3.1/lib/hutch/worker.rb#L28. In turn, ConsumerWorkPool#shutdown is called by the broker https://github.com/ruby-amqp/hutch/blob/v1.3.1/lib/hutch/broker.rb#L229

Why can't those messages be handled?

Because ConsumerWorkPool#shutdown removes all threads from the pool in a way that does not allow those messages to be handled by this process's consumers https://github.com/ruby-amqp/bunny/blob/2.23.0/lib/bunny/consumer_work_pool.rb#L60.

More specifically, it removes threads by submitting a terminal message to the internal queue. Such messages are implemented so that they terminate the thread that handles them.

Since the internal queue is a FIFO data stucture, any messages that get enqueued/submitted after those terminal messages, are not going to be processed, because there will be no threads left.

Why do messages keep arriving into the ConsumerWorkPool during hutch graceful shutdown

Because the Network I/O Activity thread reads AMQP framesets from the socket, handles them https://github.com/ruby-amqp/bunny/blob/2.23.0/lib/bunny/reader_loop.rb#L90 and submits them to the ConsumerWorkPool https://github.com/ruby-amqp/bunny/blob/2.23.0/lib/bunny/channel.rb#L1841

Why do messages arrive in the socket during hutch graceful shutdown?

Because, Bunny::Consumers are actively consuming messages. The RabbitMQ server keeps pushing messages to the socket because it has not been notified to stop.

How To Reproduce

See minimal working example here https://github.com/spyroska/hutch-rpc-timeout-example
See its README.md for detailed steps on how to both reproduce the issue and verify the proposed fix.

For future reference, an outline of steps that reproduce the issue follows:

Start one RPC producer and multiple hutch RPC consumer processes (scale factor >1)

Set a timeout period in RPC producer for each RPC made while waiting for the reply. Timeout period must be equal to message expiration.
Log timeouts.

Let producer make RPCs and verify they are all handled correctly (no expired messages; a.k.a no RPC timeouts)
Gracefully shutdown one hutch RPC consumer process and verify some messages expire (a.k.a timeouts are logged)

Solution

The solution to this can be as simple as cancelling the Hutch::Broker channel consumers before shutting down the ConsumerWorkPool (a.k.a before terminating the threads). By stopping the influx of messages at their source, we ensure the remaining messages will be drained properly when the work pool is shutdown and no more messages will be submitted during shutdown.

As a matter of fact, cancelling consumers is the first step of the graceful shutdown sequence in kicks and sneakers https://github.com/ruby-amqp/kicks/blob/3.2.0/lib/sneakers/queue.rb#L69 .

Please, consider the proposed PR which applies a similar solution as kicks, hoping it is applicable here as well.

michaelklishin · 2025-03-23T22:33:48Z

michaelklishin
Mar 23, 2025
Maintainer

@spyroska I don't understand how the fact that these messages have TTL to be relevant. If a Hutch process terminates completely abruptly, then all outstanding deliveries of consumers that use manual acknowledgements will be automatically re-queued, whether they use TTL or not.

The "in flight deliveries" problem has nothing to do with TTL, and messages with short TTL can be deleted without being consumed in plenty of other scenarios. If they are unacceptable to you, don't use message TTL or a use a much higher message TTL and automatic acknowledgements.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message expiration during graceful shutdown #411

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Message expiration during graceful shutdown #411

Uh oh!

spyroska Mar 23, 2025

Issue Description

What is the issue?

What versions have been verified to be affected?

What messages are affected?

Why can't those messages be handled?

Why do messages keep arriving into the ConsumerWorkPool during hutch graceful shutdown

Why do messages arrive in the socket during hutch graceful shutdown?

How To Reproduce

Solution

Replies: 1 comment

Uh oh!

michaelklishin Mar 23, 2025 Maintainer

spyroska
Mar 23, 2025

michaelklishin
Mar 23, 2025
Maintainer