Returning per-size-class cache memory to the OS #4911

SeanTAllen · 2026-03-03T15:55:21Z

SeanTAllen
Mar 3, 2026
Maintainer

PR #4910 adds madvise(MADV_DONTNEED) to pool_free_pages, which handles allocations over 1 MB. That's the easy part. The hard part is everything else.

The problem

The pool allocator maintains per-size-class free lists (pool_local[] and pool_global[]) for allocations from 32 bytes up to 1 MB. When a size class needs more memory, it carves items from POOL_ALIGN-sized (1 KB) blocks, which themselves come from large mmap'd regions. Once an item is carved out and placed on a size-class free list, it stays there forever. The physical pages backing those items remain committed even when the items are sitting idle on the free list.

Under load (like stallion's hello world under siege), the runtime spins up threads, each thread builds up its per-size-class caches, and RSS climbs. When load drops, those caches retain their pages. The memory is technically "free" from the application's perspective but the OS doesn't know that. In testing, this accounts for roughly 38 MB of RSS difference between pool_memalign (which uses malloc/free and lets the C allocator handle returns) and the default pool allocator.

Why this is hard

The fundamental issue is that the pool allocator doesn't track pages — it tracks individual items within size classes. A 4 KB page might contain 128 items of 32 bytes each. You can only decommit that page when all 128 items are free. The pool has no way to answer that question today because items on the free list are just a linked list of pointers with no page-level grouping.

jemalloc solved this with its "extent" system. Each extent tracks a contiguous run of pages and knows how many items within it are allocated vs free. When the free count hits zero for an extent, the whole thing can be decommitted. tcmalloc has a similar mechanism with its "span" concept.

Possible approaches

Page-level tracking: Add a page map that tracks, for each page used by a size class, how many items are currently allocated. When a free drops the count to zero, decommit the page. This is conceptually simple but touches the hot path for every pool_alloc and pool_free — the counter increment/decrement would need to be cheap and cache-friendly.

Periodic sweeping: Rather than tracking on every alloc/free, periodically walk the free lists and identify pages where all items are free. This keeps the hot path untouched but adds a background cost and latency before memory is returned. It also requires being able to map from an item address back to its page and from a page to all items it contains.

Hybrid: Track at the POOL_ALIGN (1 KB block) level rather than the OS page level. The pool already allocates in POOL_ALIGN chunks for small size classes, so the block boundaries are known. When all items in a block are free, decommit it. This is coarser than page-level tracking but aligns with the existing allocation structure.

Scope and constraints

Any solution needs to:

Not regress allocation/free performance on the hot path. The pool allocator is performance-critical.
Work correctly with the thread-local / global free list split. Items migrate between thread-local and global lists, which complicates page-level tracking.
Handle the fact that different size classes share the same underlying mmap'd regions. Decommitting a page that's split across two size classes requires coordination.
Be compatible with ASAN/Valgrind instrumentation.

What this discussion is for

This is a research placeholder. The problem is well-understood, the solutions are known from other allocators, but the implementation is non-trivial and needs careful design work. Capturing the analysis here so it doesn't get lost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Returning per-size-class cache memory to the OS #4911

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Returning per-size-class cache memory to the OS #4911

Uh oh!

SeanTAllen Mar 3, 2026 Maintainer

The problem

Why this is hard

Possible approaches

Scope and constraints

What this discussion is for

Replies: 0 comments

SeanTAllen
Mar 3, 2026
Maintainer