Skip to content

Conversation

ok2c
Copy link
Member

@ok2c ok2c commented Aug 10, 2025

@rschmitt This change-set introduces a relatively cheap 'stale' connection check command that works with both HTTP/1.1 and H2 protocols and can be used instead of a more expensive Ping command.

Please take a look.

@rschmitt
Copy link
Contributor

Do you have the corresponding client changes? I tried integrating this into PoolingAsyncClientConnectionManager after the fashion of the H2 PingCommand code path, and now my requests don't finish.

I think there's might be a race condition here involving command execution and connection closure. I remember a few years ago there was an issue where enabling inactive connection validation would cause the client to hang by submitting a PingCommand on a connection that had already received a GOAWAY. My records indicate that this bug was fixed, but now I'm not so sure, or maybe it was reintroduced. In the debugger, I don't even see the IOReactor wake up when the command is submitted; readyCount comes back as 0. It doesn't look like there's anything in IOSessionImpl::enqueue that prevents a Command from being submitted against an already-closed connection.

callback.completed(false);
}
final ByteBuffer buffer = ByteBuffer.allocate(0);
final int bytesRead = ioSession.channel().read(buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this read is actually redundant. Command is modeled as a write event, and reads are processed before writes. So if the connection is closed, we'll already know by the time we get here, provided that we include the changes from #543.

The end result ends up being very similar to the "synchronization barrier" between the event loop and the connection pool that I was talking about in that PR. The internal race condition basically goes away as long as connection reuse is completed through the event loop, which then has a chance to update all the relevant bookkeeping with respect to whatever IO events are pending.

@rschmitt
Copy link
Contributor

rschmitt commented Aug 11, 2025

I made the following changes locally:

  1. I pulled in this change
  2. I added Mark HTTP/1.1 async connection as not open (non-reusable) as soon as it becomes closed by the opposite endpoint #543 (specifically dfa2cd5)
  3. I taught PoolingAsyncClientConnectionManager to submit a StaleCheckCommand as the implementation of validateAfterInactivity on HTTP/1.1 connections
  4. I set setValidateAfterInactivity(ZERO_MILLISECONDS) in TestConnectionClosureRace

When I do all of this, the results are dramatic:

Http: Sequential requests (rapid): 2,500 succeeded; 0 failed (100.00% success rate)
Http: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Http: Single large batch: 30 succeeded; 0 failed (100.00% success rate)
Http: Multiple small batches: 15 succeeded; 0 failed (100.00% success rate)

Https: Sequential requests (rapid): 2,499 succeeded; 1 failed (99.96% success rate, 0.00% retriable)
Https: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Https: Single large batch: 29 succeeded; 1 failed (96.67% success rate, 0.00% retriable)
Https: Multiple small batches: 13 succeeded; 2 failed (86.67% success rate, 0.00% retriable)

When I disable inactive connection validation, I get the same results I've been getting:

Http: Sequential requests (rapid): 2,494 succeeded; 6 failed (99.76% success rate, 0.24% retriable)
Http: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Http: Single large batch: 15 succeeded; 15 failed (50.00% success rate, 50.00% retriable)
Http: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate, 33.33% retriable)

Https: Sequential requests (rapid): 2,476 succeeded; 24 failed (99.04% success rate, 0.96% retriable)
Https: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Https: Single large batch: 15 succeeded; 15 failed (50.00% success rate, 50.00% retriable)
Https: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate, 26.67% retriable)

We're definitely on the right track. If I'm right about the IO in doStalecheck, that code can probably all be deleted, and StaleCheckCommand can be renamed ConnectionLeaseCommand or something.

@rschmitt
Copy link
Contributor

I think there's might be a race condition here involving command execution and connection closure.

I think I was mistaken about this. The actual issue might have been the no-op implementation of StaleCheckCommand::cancel. IOSessionImpl::enqueue enqueues the command, but then cancels it if isStatusClosed(), which seems like it should be okay.

@ok2c
Copy link
Member Author

ok2c commented Aug 12, 2025

We're definitely on the right track. If I'm right about the IO in doStalecheck, that code can probably all be deleted, and StaleCheckCommand can be renamed ConnectionLeaseCommand or something.

@rschmitt The results looks encouraging.

@ok2c
Copy link
Member Author

ok2c commented Aug 12, 2025

@rschmitt I will fix the problem with #cancel and see what else could be improved.

@ok2c ok2c force-pushed the stale_conn_check_async branch from e67636b to f70f0fd Compare August 12, 2025 12:48
@rschmitt
Copy link
Contributor

Copying a comment from #543:

No amount of trickery can make the protocol handler aware of the opposite endpoint's intent to not honor the protocol requirements with regards of the connection persistence, especially when using TLS to negotiate the connection closure. It can get lucky and read the end of stream over a plain connection but this is just luck.

I think here, "the protocol requirements with regards of the connection persistence" refers to my test server in TestConnectionClosureRace, which closes the connection without sending a Connection: close header. I did this deliberately in order to simulate stale connection reuse and provoke race conditions, but I was reminded today that all of the connection persistence headers are explicitly prohibited by HTTP/2 and HTTP/3.

This means that what is happening in my reproducer is actually pretty representative of what happens on an HTTP/2 connection, in the sense that recognizing the closure of the connection requires the client to read beyond the last byte of the current response. Instead of sending a connection: close header in the response itself (which would be considered malformed, per the spec), the server would send a separate GOAWAY(last_stream_id, NO_ERROR) frame. It's important to handle this case, particularly as we move towards supporting HTTP/2 multiplexing: I'd say that h2 connection reuse should always be completed through the event loop, so that we don't open a new stream until we know that an unprocessed GOAWAY frame isn't sitting in the recv buffer. (We could also check the flow control numbers for the connection and make sure the write end isn't already completely saturated.)

We may also want to turn TestConnectionClosureRace into either an integration test or an example class. I'm reluctant to write a version of it that actually asserts anything, which seems like a recipe for a flaky test, but just keeping it around as an example class would be really useful for future investigations.

@ok2c
Copy link
Member Author

ok2c commented Aug 14, 2025

but I was reminded today that all of the connection persistence headers are explicitly prohibited by HTTP/2 and HTTP/3.

@rschmitt They are and for a good reason: multiplexing connections cannot permit a single message exchange to shut down the connection potentially shared by many concurrent exchanges.

This means that what is happening in my reproducer is actually pretty representative of what happens on an HTTP/2 connection

I cannot agree with that. GOAWAY(NO_ERROR) signals initiation of connection shutdown. A well behaved endpoint should not just write out GOAWAY(NO_ERROR) and immediately drop connection. It can, however, half-close it on its end and stop processing incoming requests which can be safely retried over another connection.

Please do add TestConnectionClosureRace to the project a test with no asserts. It is certainly useful but I cannot agree it represents a protocol conformant behavior.

@ok2c ok2c requested a review from rschmitt August 14, 2025 10:38
@ok2c
Copy link
Member Author

ok2c commented Aug 14, 2025

@rschmitt I looked at various ways of improving the logic in the stale connection command but could not figure out anything useful on top of what we already have. I do not think the check can be made 100% reliable and this is probably as good as it gets for now. Do you want me to commit the change-set, keep it open or drop it?

@rschmitt
Copy link
Contributor

A well behaved endpoint should not just write out GOAWAY(NO_ERROR) and immediately drop connection.

And a well-behaved client should not attempt to reuse a connection on which it has already received a GOAWAY frame. I haven't looked at the h2 duplexer code to see how this is handled, but my point is that the StaleCheckCommand approach is actually a really good strategy for connection reuse: it allows the client to perform an additional read on the connection itself before reusing a connection, which allows connection closure (at whatever layer of the protocol stack) to be processed.

Overall, I see this as a more modern approach. The HTTP 1.x connection headers are legacy stuff from the '90s, and we shouldn't rely on them exclusively to prevent stale/closed connection reuse. I actually think that StaleCheckCommand could replace PingCommand as the connection validation strategy for HTTP/2, since it's almost as reliable but way faster since the caller doesn't have to wait for a round-trip over the network.

I looked at various ways of improving the logic in the stale connection command but could not figure out anything useful on top of what we already have. I do not think the check can be made 100% reliable and this is probably as good as it gets for now. Do you want me to commit the change-set, keep it open or drop it?

I'm happy with the version of this change I tested locally. The client-server dyad is a distributed system, so connection closure race conditions are always a possibility, but I think the changes I described earlier fix the purely internal race condition within the client. I'll submit a branch with those changes, and any further simplifications.

@rschmitt
Copy link
Contributor

@arturobernalg
Copy link
Member

arturobernalg commented Aug 15, 2025

https://github.com/rschmitt/httpcomponents-core/tree/stale_conn_check_async

H2ConnPool#validateSession() still enqueues PingCommand, consider swapping it for StaleCheckCommand otherwise LGTM

@ok2c ok2c force-pushed the stale_conn_check_async branch 2 times, most recently from cdb3ae3 to 8464d89 Compare August 15, 2025 12:16
@ok2c
Copy link
Member Author

ok2c commented Aug 15, 2025

And a well-behaved client should not attempt to reuse a connection on which it has already received a GOAWAY

@rschmitt This should be the case already. Connections should immediately go into graceful shutdown state immediately upon the receipt of GOAWAY. I will double-check. though

StaleCheckCommand could replace PingCommand as the connection validation strategy for HTTP/2

Agreed.

I'll submit a branch with those changes, and any further simplifications.

I have incorporated your changes into the change-set. In fact the change-set is now in your name as you have contributed most to it.

@ok2c
Copy link
Member Author

ok2c commented Aug 15, 2025

https://github.com/rschmitt/httpcomponents-core/tree/stale_conn_check_async

H2ConnPool#validateSession() still enqueues PingCommand, consider swapping it for StaleCheckCommand otherwise LGTM

@arturobernalg Good catch. Corrected.

@ok2c
Copy link
Member Author

ok2c commented Aug 15, 2025

@arturobernalg @rschmitt Please do one more pass.

@ok2c ok2c force-pushed the stale_conn_check_async branch from 8464d89 to 0ba75a8 Compare August 15, 2025 12:45
Copy link
Member

@arturobernalg arturobernalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ok2c LGTM

@rschmitt
Copy link
Contributor

This should be the case already. Connections should immediately go into graceful shutdown state immediately upon the receipt of GOAWAY. I will double-check. though

It'll take me a few hours to add HTTP/2 support to my tester. I'll follow up with that later, as well as with the conversion of TestConnectionClosureRace into an example class.

@rschmitt
Copy link
Contributor

It looks like I was right about HTTP/2. The race condition statistics are exactly the same as for HTTP/1.1:

h2: Validation disabled: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
h2: Validation enabled:  Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
h2: Validation disabled: Sequential requests (rapid): 2,473 succeeded; 27 failed (98.92% success rate, 0.00% retriable)
h2: Validation enabled:  Sequential requests (rapid): 2,500 succeeded; 0 failed (100.00% success rate)
h2: Validation disabled: Single large batch: 15 succeeded; 15 failed (50.00% success rate, 0.00% retriable)
h2: Validation enabled:  Single large batch: 30 succeeded; 0 failed (100.00% success rate)
h2: Validation disabled: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate, 0.00% retriable)
h2: Validation enabled:  Multiple small batches: 15 succeeded; 0 failed (100.00% success rate)

h2c: Validation disabled: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
h2c: Validation enabled:  Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
h2c: Validation disabled: Sequential requests (rapid): 2,483 succeeded; 17 failed (99.32% success rate, 0.00% retriable)
h2c: Validation enabled:  Sequential requests (rapid): 2,500 succeeded; 0 failed (100.00% success rate)
h2c: Validation disabled: Single large batch: 15 succeeded; 15 failed (50.00% success rate, 0.00% retriable)
h2c: Validation enabled:  Single large batch: 30 succeeded; 0 failed (100.00% success rate)
h2c: Validation disabled: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate, 0.00% retriable)
h2c: Validation enabled:  Multiple small batches: 15 succeeded; 0 failed (100.00% success rate)

Additionally, there are two problems:

  1. The h2 test case for rapid sequential tests has a tendency to hang, irrespective of whether stale connection validation is enabled. I'm guessing there's a race condition here that causes the continuation (PingCommand, StaleCheckCommand, or connection lease callback) to be dropped on the floor. I'm going to submit TestConnectionClosureRace as an integration test after all, since even without any additional assertions it's catching a serious bug.
  2. This PR does not include the changes from dfa2cd5 to read all available data from the socket. Without those changes, StaleCheckCommand doesn't work properly on plaintext HTTP/1.1 connections:
http: Validation disabled: Sequential requests (rapid): 2,489 succeeded; 11 failed (99.56% success rate, 0.44% retriable)
http: Validation enabled:  Sequential requests (rapid): 2,499 succeeded; 1 failed (99.96% success rate, 0.00% retriable)
http: Validation disabled: Single large batch: 15 succeeded; 15 failed (50.00% success rate, 0.00% retriable)
http: Validation enabled:  Single large batch: 15 succeeded; 15 failed (50.00% success rate, 0.00% retriable)
http: Validation disabled: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate, 0.00% retriable)
http: Validation enabled:  Multiple small batches: 10 succeeded; 5 failed (66.67% success rate, 0.00% retriable)

Remember, the contract for StaleCheckCommand is that the callback occurs after all pending reads on the connection have been processed, so the caller has to be sure to do that.

Copy link
Contributor

@rschmitt rschmitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please incorporate the changes from dfa2cd5 (or reopen #543 and ship them separately)

@ok2c
Copy link
Member Author

ok2c commented Aug 19, 2025

Please incorporate the changes from dfa2cd5 (or reopen #543 and ship them separately)

@rschmitt I am not in favor of re-opening and committing #543 for the reasons stated already. I think the removal of the read operation from the stale connection check was a mistake.

@ok2c ok2c force-pushed the stale_conn_check_async branch from 0ba75a8 to 1bf3859 Compare August 20, 2025 09:38
@rschmitt
Copy link
Contributor

@ok2c I tested with 1bf3859 and the race condition still exists on HTTP (but is fixed on HTTPS). If you don't want to ship #543 then I think reintroducing the read operation will work.

@ok2c
Copy link
Member Author

ok2c commented Aug 21, 2025

I think reintroducing the read operation will work

@rschmitt That is unfortunate because 1bf3859 already does an extra read. It can be though the call is ignored due to the read buffer capacity being zero.

@ok2c ok2c force-pushed the stale_conn_check_async branch 3 times, most recently from bb50a17 to e1b344b Compare August 24, 2025 17:27
The `StaleCheckCommand`, like all `Command`s, is modeled as a write operation, and `InternalDataChannel::onIOEvent` processes reads before writes. Therefore, by the time the stale check command is processed, the client's view of the connection is already up-to-date; any server-initiated connection closure (FIN, RST, GOAWAY) has already been read and processed.
@ok2c ok2c force-pushed the stale_conn_check_async branch from e1b344b to 0e22b40 Compare August 26, 2025 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants