Skip to content

Conversation

@arturobernalg
Copy link
Member

@arturobernalg arturobernalg commented Nov 14, 2025

This is the first step towards a pluggable pooled ByteBuffer allocator.
The patch adds PooledByteBufferAllocator (power-of-two size buckets, global pool + per-thread caches) and switches the HTTP/2 FrameFactory and the benchmark to use ByteBufferAllocator. Behaviour is unchanged except for using pooled buffers for small control frames.
If this direction looks reasonable, I’ll follow up by threading the allocator through IOSession/SSLIOSession and the async codecs and add minimal metrics; otherwise I’ll keep it local to HTTP/2.

ByteBuffer allocator throughput (JMH)

Benchmark bufferSize iterations Mode Cnt Score (ops/ms) Error (ops/ms)
pooled_allocator_shared 1024 100 thrpt 10 1644.982 14.006
pooled_allocator_shared 8192 100 thrpt 10 533.638 34.307
pooled_allocator_shared 65536 100 thrpt 10 59.422 0.937
pooled_allocator_thread_local 1024 100 thrpt 10 539.612 5.518
pooled_allocator_thread_local 8192 100 thrpt 10 201.345 4.451
pooled_allocator_thread_local 65536 100 thrpt 10 19.603 0.501
simple_allocator_shared 1024 100 thrpt 10 172.750 4.893
simple_allocator_shared 8192 100 thrpt 10 23.083 0.199
simple_allocator_shared 65536 100 thrpt 10 2.883 0.037
simple_allocator_thread_local 1024 100 thrpt 10 129.873 1.075
simple_allocator_thread_local 8192 100 thrpt 10 21.075 0.088
simple_allocator_thread_local 65536 100 thrpt 10 2.401 0.062

@ok2c WDYT?

@arturobernalg arturobernalg requested a review from ok2c November 14, 2025 19:17
Introduce PooledByteBufferAllocator with global buckets and per-thread caches and use it in HTTP/2 FrameFactory.
…lculation while preserving existing behaviour.
@ok2c
Copy link
Member

ok2c commented Nov 18, 2025

@arturobernalg I may be wrong but I was under impression that many (if not all) Java frameworks arrived at the same conclusion that memory pooling was counterproductive as of Java 8 given the efficiency of modern garbage collection algorithms. I will run the micro-benchmark locally and look at the results, but it may take me a certain while.

Generally I see no problem with providing pluggable allocators as long as the simple one remains default and you are willing to maintain more complex ones.

@rschmitt do you happen to have an opinion on this matter?

@rschmitt
Copy link
Contributor

@ok2c I'm going to ask one or two more qualified people for an opinion and get back to you. My understanding is that object pooling can outperform garbage collection, but it's harder to do than you'd think. (There's also the question of what "outperform" means. What are we measuring, tail latencies? CPU overhead? Heap footprint?) Pooled buffers also come with a lot of risks, like increased potential for memory leaks, or security vulnerabilities such as buffer over-reads.

The Javadoc says that the PooledByteBufferAllocator is inspired by Netty's pooled buffer allocator, but which one? In Netty 4.2, they changed the default allocator from the pooled allocator to the AdaptiveByteBufAllocator. What does that mean, exactly? ¯\_(ツ)_/¯ Evidently it may have something to do with virtual threads.

I guess the main concern I have here is the effectiveness of adding buffer pooling retroactively, compared with the cost in code churn. Typically what I see is frameworks or applications that are designed from the ground up to be garbage-free or zero-copy or what have you. I think this proposal would be more persuasive if I knew what we were measuring and what our performance target is, and what the hotspots currently are for ephemeral garbage. Can they be addressed with a minimum of API churn? (I find it's very difficult to thread new parameters deep into HttpComponents; if we implemented pooling, I'd prefer to make it a purely internal optimization, and an implementation detail. We should be more hesitant to increase our API surface area.)

Finally, I think it's a little late in the development cycle for httpcore 5.4 to be considering such a change. Any usage of pooling in the HTTP/2 or TLS or IOReactor implementation should probably be gated behind a system property and considered experimental.

Covers mixed-route workloads with slow discard and expiry paths
Test-only change, no impact on public API or defaults
@arturobernalg
Copy link
Member Author

arturobernalg commented Nov 18, 2025

@olegk @rschmitt I’ve added a small JMH benchmark that exercises the old and new pool under mixed routes with slow discard / expiry.
On my machine the segmented pool removes the cross-route stall and slightly improves tail latency while keeping throughput comparable.
Happy to adjust the scenario or parameters if you’d like to capture other access patterns.

To clarify the Netty reference: the allocator is conceptually closest to Netty 4.1s

Allocator Kind Buffer Throughput (ops/ms) Error
pooled_allocator_shared HEAP 1024 517.697 ±7.829
pooled_allocator_shared DIRECT 1024 527.269 ±20.476
pooled_allocator_shared HEAP 8192 194.948 ±1.124
pooled_allocator_shared DIRECT 8192 222.407 ±2.573
pooled_allocator_shared HEAP 65536 19.387 ±0.297
pooled_allocator_shared DIRECT 65536 18.704 ±1.621
pooled_allocator_thread_local HEAP 1024 519.383 ±9.957
pooled_allocator_thread_local DIRECT 1024 544.220 ±11.254
pooled_allocator_thread_local HEAP 8192 205.072 ±2.435
pooled_allocator_thread_local DIRECT 8192 222.178 ±7.711
pooled_allocator_thread_local HEAP 65536 18.960 ±0.172
pooled_allocator_thread_local DIRECT 65536 18.286 ±1.217
simple_allocator_shared HEAP 1024 150.141 ±6.162
simple_allocator_shared DIRECT 1024 8.553 ±5.767
simple_allocator_shared HEAP 8192 24.545 ±0.880
simple_allocator_shared DIRECT 8192 5.835 ±2.174
simple_allocator_shared HEAP 65536 2.767 ±0.162
simple_allocator_shared DIRECT 65536 2.351 ±0.244
simple_allocator_thread_local HEAP 1024 149.243 ±5.933
simple_allocator_thread_local DIRECT 1024 8.373 ±5.096
simple_allocator_thread_local HEAP 8192 25.226 ±1.756
simple_allocator_thread_local DIRECT 8192 5.665 ±2.431
simple_allocator_thread_local HEAP 65536 2.700 ±0.248
simple_allocator_thread_local DIRECT 65536 2.274 ±0.182
Allocator Kind Buffer gc.alloc.rate.norm (B/op) gc.count gc.time (ms)
pooled_allocator_shared HEAP 1024 0.013 ≈0 -
pooled_allocator_shared DIRECT 1024 0.013 ≈0 -
pooled_allocator_shared HEAP 8192 0.035 ≈0 -
pooled_allocator_shared DIRECT 8192 0.031 ≈0 -
pooled_allocator_shared HEAP 65536 0.356 ≈0 -
pooled_allocator_shared DIRECT 65536 0.370 ≈0 -
pooled_allocator_thread_local HEAP 1024 0.013 ≈0 -
pooled_allocator_thread_local DIRECT 1024 0.013 ≈0 -
pooled_allocator_thread_local HEAP 8192 0.034 ≈0 -
pooled_allocator_thread_local DIRECT 8192 0.031 ≈0 -
pooled_allocator_thread_local HEAP 65536 0.364 ≈0 -
pooled_allocator_thread_local DIRECT 65536 0.378 ≈0 -
simple_allocator_shared HEAP 1024 104000.046 100.000 94.000
simple_allocator_shared DIRECT 1024 13600.926 8.000 4147.000
simple_allocator_shared HEAP 8192 820800.283 89.000 84.000
simple_allocator_shared DIRECT 8192 13601.245 20.000 2020.000
simple_allocator_shared HEAP 65536 6555202.508 81.000 77.000
simple_allocator_shared DIRECT 65536 13602.957 29.000 252.000
simple_allocator_thread_local HEAP 1024 104000.046 94.000 90.000
simple_allocator_thread_local DIRECT 1024 13600.875 8.000 3920.000
simple_allocator_thread_local HEAP 8192 820800.276 86.000 91.000
simple_allocator_thread_local DIRECT 8192 13601.321 19.000 1827.000
simple_allocator_thread_local HEAP 65536 6555202.575 86.000 81.000
simple_allocator_thread_local DIRECT 65536 13603.057 27.000 272.000

@rschmitt
Copy link
Contributor

rschmitt commented Nov 19, 2025

I asked Aleksey Shipilëv for his thoughts:

Depends. In a pure allocation benchmark, allocation would likely be on par with reuse. But once you get far from that ideal, awkward things start to happen.

  1. When there is any non-trivial live set in the heap, GC would have to at least visit it every so often; that "so often" is driven by GC frequency, which is driven by allocation rate. Pure allocation speed and pure reclamation cost becomes much less relevant in this scenario -- what else is happenning dominates hard. Generational GCs win you some, but they really only prolong the inevitable.
  2. When objects are allocated, they are nominally zeroed. Under high allocation rate, that is easily the slowest part, think ~10 GB/sec per thread. Re-use often comes with avoiding these cleanups, often at the cost of weaker security posture (leaking data between reused buffers).
  3. For smaller objects, the metadata management (headers, all that fluff) dominates the allocation path performance, and is often logically intermixed with the real work. E.g. you rarely allocate 10M objects just because, there is likely some compute in between. But allocating new byte[BUF_SIZE] (BUF_SIZE=1M defined in another file) is very easy. So hitting (1) and (2) is much easier the larger the object in questions get.
  4. For smaller objects, the pooling overheads become on par with the size of the objects themselves. The calculation for total memory footprint can push the scale in either direction.
  5. For some awkward classes like DirectByteBuffers that have separate cleanup schedule, unbounded allocation is a recipe for a meltdown.

So answer is somewhat along the lines of: Pooling common (small) objects? Nah, too much hassle for too little gain. Pooling large buffers? Yes, that is a common perf optimization. Pooling large buffers with special lifecycle? YES, do not even think about not doing the pooling. For everything in between the answer is somewhere in between.

Here, "special lifecycle" refers to things like finalizers, Cleaners, weak references, etc.; nothing that would apply to a simple byte buffer.

Another interesting point that came up is that if you use heap (non-direct) byte buffers, and if the pool doesn't hold on to byte buffer references while they are leased out, then there is no risk of a memory leak: returning the buffer to the pool is purely an optimization. Since HEAP and DIRECT have near-identical performance, maybe we should just hardcode a pooled heap buffer allocator into key hotspots.

@ok2c
Copy link
Member

ok2c commented Nov 20, 2025

@rschmitt Thank you so much for such an informative summary. Please convey my gratitude to Aleksey.

One thing bugs me is how big is big? How big should be byte buffers to justify pooling? If it is a couple of MB, then memory pooling may be useful in our case.

Here, "special lifecycle" refers to things like finalizers, Cleaners, weak references, etc.; nothing that would apply to a simple byte buffer.

I think we have objects with "special lifecycle" in the HttpClient Caching module only but they are backed by files and not byte buffers. There is nothing else I can think of.

However, I imagine the classic on async facade may actually qualify as a potential beneficiary of the pooled memory allocator, so I am leaning towards approving this change-set and letting @arturobernalg proceed with further experiments.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants