Skip to content

Conversation

@anarthal
Copy link
Collaborator

@anarthal anarthal commented Oct 9, 2025

Adds error::write_timeout

close #104

@anarthal
Copy link
Collaborator Author

anarthal commented Oct 9, 2025

@mzimbres this is a very early prototype, and probably far from the implementation you were expecting. I want to have some discussion to make sure the direction I've taken doesn't have any flaws.

The idea is that the health checker task disappears, with health checks performed by the writer and reader directly.

On the writer side:

  • When there is data to be written, the writer writes it and does no health checks.
  • If more than config::health_check_interval elapses without anything to write, a PING is added to the multiplexer. The PING is represented as a multiplexer::elem but does not go through async_exec.
  • When writing, if we spend more than config::health_check_interval without writing a single byte, we consider the connection as broken. (Actually this requires a little bit more effort from what I've written, but should be easy to do).

On the reader side:

  • We know that the connection may stay idle for at most config::health_check_interval. After that time, a ping is issued. If we don't receive any data within that period plus config::health_check_interval, the connection is definitely dead. This leaves us a read timeout of 2* config::health_check_interval. This is a little bit of an heuristic, but I think is good enough.
  • If a PING containing an error is received, its adapter yields an error. This stops the reader and triggers a reconnection.

Let me know what you think.

@anarthal anarthal force-pushed the feature/flexible-health-checks branch 2 times, most recently from 7ee4d55 to c2fa491 Compare October 15, 2025 09:46
@anarthal anarthal force-pushed the feature/flexible-health-checks branch from c2fa491 to 5ddcdd5 Compare October 16, 2025 15:57
@anarthal anarthal marked this pull request as ready for review October 16, 2025 15:59
@anarthal anarthal requested a review from mzimbres October 16, 2025 15:59
@anarthal
Copy link
Collaborator Author

anarthal commented Oct 16, 2025

I've kept the cancel_after in the reader approach. I think it is roughly equivalent to checking a timestamp updated by the reader. If I understood your approach, we'd do the following:

  • We have a timestamp in the connection's state, updated by the reader every time some data is received (i.e. after an async_read_some completes).
  • If no data is received for a certain time (let's call this value read_timeout), the connection is considered unhealthy and re-connected.
  • The writer checks after every async_write_some operation, and after waiting for more data to be available, whether the timestamp is older than read_timeout. If it is, the connection should fail.

My line of thought is that this is really imposing a timeout to reads, but implemented using polling. I think it can be more unreliable than just using asio::cancel_after.

I've chosen this read_timeout to be 2 * health_check_interval. When the connection is idle, no data is read or written during health_check_interval. A PING will be issued after health_check_interval. This leaves the server another health_check_interval to respond to the PING.

I'm open to changing the implementation if there is something I've missed here, let me know.

I've also updated the reader and writer actions to be variant-like, so they take less stack space. I've used unions with a lean class interface to make it lighter at compile time.


/// Message used by the health-checker in @ref boost::redis::basic_connection::async_run.
/// Message used by `PING` commands sent by the health checker.
std::string health_check_id = "Boost.Redis";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: When the user does not provide one id, I think we should format this to contain the id returned by the HELLO command.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean "does not provide one id" here? Leave health_check_id to the default? Make health_check_id empty?

The change is not trivial because it requires coordination between the setup task (which may not contain a HELLO at all) and the health checker.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if the user does not provide its own health_check_id the connection could set it to "Boost.Redis (conn-id)". At the moment we only set to "Boost.Redis".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to detect whether the setup request contains a hello, and then parse the response and update the ping ID. Do you think it's worth it?

@mzimbres
Copy link
Collaborator

I am still studying the code but I think the writter-offset should be moved down to the multiplexer where the writer buffer is located. The writer-fsm can then call mpx.commit_write_some(bytes). Then the writer_op wouldn't need this

auto buf = asio::buffer(conn->st_.mpx.get_write_buffer().substr(act.write_offset()));

because the multiplexer knowns about the offset, so conn->st_.mpx.get_write_buffer() would already be correct.

@anarthal
Copy link
Collaborator Author

I am still studying the code but I think the writter-offset should be moved down to the multiplexer where the writer buffer is located. The writer-fsm can then call mpx.commit_write_some(bytes). Then the writer_op wouldn't need this

auto buf = asio::buffer(conn->st_.mpx.get_write_buffer().substr(act.write_offset()));

because the multiplexer knowns about the offset, so conn->st_.mpx.get_write_buffer() would already be correct.

I think this is a very good idea. I will work on it.

@anarthal
Copy link
Collaborator Author

Comments applied.

@anarthal anarthal merged commit 2b09ecb into boostorg:develop Oct 20, 2025
17 checks passed
@anarthal anarthal deleted the feature/flexible-health-checks branch October 20, 2025 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make the health checker more flexible

2 participants