Makes health checks flexible so they don't tear down connections under heavy load #328

anarthal · 2025-10-09T10:27:37Z

Adds error::write_timeout

close #104

anarthal · 2025-10-09T10:36:16Z

@mzimbres this is a very early prototype, and probably far from the implementation you were expecting. I want to have some discussion to make sure the direction I've taken doesn't have any flaws.

The idea is that the health checker task disappears, with health checks performed by the writer and reader directly.

On the writer side:

When there is data to be written, the writer writes it and does no health checks.
If more than config::health_check_interval elapses without anything to write, a PING is added to the multiplexer. The PING is represented as a multiplexer::elem but does not go through async_exec.
When writing, if we spend more than config::health_check_interval without writing a single byte, we consider the connection as broken. (Actually this requires a little bit more effort from what I've written, but should be easy to do).

On the reader side:

We know that the connection may stay idle for at most config::health_check_interval. After that time, a ping is issued. If we don't receive any data within that period plus config::health_check_interval, the connection is definitely dead. This leaves us a read timeout of 2* config::health_check_interval. This is a little bit of an heuristic, but I think is good enough.
If a PING containing an error is received, its adapter yields an error. This stops the reader and triggers a reconnection.

Let me know what you think.

anarthal · 2025-10-16T16:07:35Z

I've kept the cancel_after in the reader approach. I think it is roughly equivalent to checking a timestamp updated by the reader. If I understood your approach, we'd do the following:

We have a timestamp in the connection's state, updated by the reader every time some data is received (i.e. after an async_read_some completes).
If no data is received for a certain time (let's call this value read_timeout), the connection is considered unhealthy and re-connected.
The writer checks after every async_write_some operation, and after waiting for more data to be available, whether the timestamp is older than read_timeout. If it is, the connection should fail.

My line of thought is that this is really imposing a timeout to reads, but implemented using polling. I think it can be more unreliable than just using asio::cancel_after.

I've chosen this read_timeout to be 2 * health_check_interval. When the connection is idle, no data is read or written during health_check_interval. A PING will be issued after health_check_interval. This leaves the server another health_check_interval to respond to the PING.

I'm open to changing the implementation if there is something I've missed here, let me know.

I've also updated the reader and writer actions to be variant-like, so they take less stack space. I've used unions with a lean class interface to make it lighter at compile time.

mzimbres · 2025-10-16T20:43:38Z

include/boost/redis/config.hpp


-   /// Message used by the health-checker in @ref boost::redis::basic_connection::async_run.
+   /// Message used by `PING` commands sent by the health checker.
   std::string health_check_id = "Boost.Redis";


Sidenote: When the user does not provide one id, I think we should format this to contain the id returned by the HELLO command.

What does it mean "does not provide one id" here? Leave health_check_id to the default? Make health_check_id empty?

The change is not trivial because it requires coordination between the setup task (which may not contain a HELLO at all) and the health checker.

I mean, if the user does not provide its own health_check_id the connection could set it to "Boost.Redis (conn-id)". At the moment we only set to "Boost.Redis".

We'd need to detect whether the setup request contains a hello, and then parse the response and update the ping ID. Do you think it's worth it?

include/boost/redis/connection.hpp

mzimbres · 2025-10-16T21:26:08Z

I am still studying the code but I think the writter-offset should be moved down to the multiplexer where the writer buffer is located. The writer-fsm can then call mpx.commit_write_some(bytes). Then the writer_op wouldn't need this

auto buf = asio::buffer(conn->st_.mpx.get_write_buffer().substr(act.write_offset()));

because the multiplexer knowns about the offset, so conn->st_.mpx.get_write_buffer() would already be correct.

anarthal · 2025-10-17T10:25:28Z

I am still studying the code but I think the writter-offset should be moved down to the multiplexer where the writer buffer is located. The writer-fsm can then call mpx.commit_write_some(bytes). Then the writer_op wouldn't need this
auto buf = asio::buffer(conn->st_.mpx.get_write_buffer().substr(act.write_offset()));
because the multiplexer knowns about the offset, so conn->st_.mpx.get_write_buffer() would already be correct.

I think this is a very good idea. I will work on it.

anarthal · 2025-10-17T11:24:36Z

Comments applied.

anarthal force-pushed the feature/flexible-health-checks branch 2 times, most recently from 7ee4d55 to c2fa491 Compare October 15, 2025 09:46

anarthal mentioned this pull request Oct 15, 2025

Moves logging into reader_fsm #332

Merged

anarthal added 26 commits October 15, 2025 17:38

Rebase initial impl on develop

17dc0a9

Use connection_state in the writer

c405188

writer timeouts in actions

18267e2

Fix possible problems with no timeouts

4f0ada4

make action a variant

3fd35fe

Make timeout part of the read action

0edc958

Initial test

884799b

refactor

60d957e

Fix reader tests

e2fd373

test health checks disabled

b06c554

test read timeout

80ddad9

Fix comment

d3da431

simplify writer

9435195

Make writer tests build

9ecbdb4

stronger writer_action interface

daa7903

Make writer_action use offset

03f470d

Fix writer tests

1d44ac4

Fix a partial success problem

235ac67

short writes test

7df92f3

add const

0a79bf1

rework writer 1

1041496

write_timeout error

75d9df9

rework logging

9c5f965

write timeout test

df7554d

ping success

678f0c2

ping error

87c4262

anarthal added 3 commits October 16, 2025 17:47

Docs

843ae9e

Remove some health check disables in tests

b9a44b8

health checks disabled

5ddcdd5

anarthal force-pushed the feature/flexible-health-checks branch from c2fa491 to 5ddcdd5 Compare October 16, 2025 15:57

anarthal marked this pull request as ready for review October 16, 2025 15:59

anarthal requested a review from mzimbres October 16, 2025 15:59

mzimbres reviewed Oct 16, 2025

View reviewed changes

include/boost/redis/connection.hpp Outdated Show resolved Hide resolved

mzimbres reviewed Oct 16, 2025

View reviewed changes

include/boost/redis/connection.hpp Outdated Show resolved Hide resolved

anarthal added 10 commits October 17, 2025 12:26

Remove unnecessary ping_resp

aaccbb4

Use cancel_at

e448731

include cleanup

b2c150d

Move the write offset to the multiplexer

2af5702

Fix multiplexer tests

a8a7a6a

short writes test

65e729d

Remove unused multiplexer::is_writing

a052493

Fix exec_fsm tests

e5ef3ff

Fix writer fsm tests

b9ee182

Fix failing multiplexer tests

3c24417

anarthal added 2 commits October 17, 2025 13:40

Remove unused function

0028adf

Missing includes

0a0d5a6

anarthal mentioned this pull request Oct 17, 2025

Moves the setup request execution to run_fsm #333

Merged

mzimbres approved these changes Oct 18, 2025

View reviewed changes

anarthal merged commit 2b09ecb into boostorg:develop Oct 20, 2025
17 checks passed

anarthal deleted the feature/flexible-health-checks branch October 20, 2025 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Makes health checks flexible so they don't tear down connections under heavy load #328

Makes health checks flexible so they don't tear down connections under heavy load #328

Uh oh!

anarthal commented Oct 9, 2025 •

edited

Loading

Uh oh!

anarthal commented Oct 9, 2025

Uh oh!

anarthal commented Oct 16, 2025 •

edited

Loading

Uh oh!

mzimbres Oct 16, 2025

Uh oh!

anarthal Oct 17, 2025

Uh oh!

mzimbres Oct 18, 2025

Uh oh!

anarthal Oct 20, 2025

Uh oh!

Uh oh!

Uh oh!

mzimbres commented Oct 16, 2025

Uh oh!

anarthal commented Oct 17, 2025

Uh oh!

anarthal commented Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Makes health checks flexible so they don't tear down connections under heavy load #328

Makes health checks flexible so they don't tear down connections under heavy load #328

Uh oh!

Conversation

anarthal commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anarthal commented Oct 9, 2025

Uh oh!

anarthal commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzimbres Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

anarthal Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

mzimbres Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

anarthal Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mzimbres commented Oct 16, 2025

Uh oh!

anarthal commented Oct 17, 2025

Uh oh!

anarthal commented Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anarthal commented Oct 9, 2025 •

edited

Loading

anarthal commented Oct 16, 2025 •

edited

Loading