Returning per-size-class cache memory to the OS #4911
SeanTAllen
started this conversation in
ponyc
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
PR #4910 adds
madvise(MADV_DONTNEED)topool_free_pages, which handles allocations over 1 MB. That's the easy part. The hard part is everything else.The problem
The pool allocator maintains per-size-class free lists (
pool_local[]andpool_global[]) for allocations from 32 bytes up to 1 MB. When a size class needs more memory, it carves items from POOL_ALIGN-sized (1 KB) blocks, which themselves come from large mmap'd regions. Once an item is carved out and placed on a size-class free list, it stays there forever. The physical pages backing those items remain committed even when the items are sitting idle on the free list.Under load (like stallion's hello world under siege), the runtime spins up threads, each thread builds up its per-size-class caches, and RSS climbs. When load drops, those caches retain their pages. The memory is technically "free" from the application's perspective but the OS doesn't know that. In testing, this accounts for roughly 38 MB of RSS difference between
pool_memalign(which uses malloc/free and lets the C allocator handle returns) and the default pool allocator.Why this is hard
The fundamental issue is that the pool allocator doesn't track pages — it tracks individual items within size classes. A 4 KB page might contain 128 items of 32 bytes each. You can only decommit that page when all 128 items are free. The pool has no way to answer that question today because items on the free list are just a linked list of pointers with no page-level grouping.
jemalloc solved this with its "extent" system. Each extent tracks a contiguous run of pages and knows how many items within it are allocated vs free. When the free count hits zero for an extent, the whole thing can be decommitted. tcmalloc has a similar mechanism with its "span" concept.
Possible approaches
Page-level tracking: Add a page map that tracks, for each page used by a size class, how many items are currently allocated. When a free drops the count to zero, decommit the page. This is conceptually simple but touches the hot path for every
pool_allocandpool_free— the counter increment/decrement would need to be cheap and cache-friendly.Periodic sweeping: Rather than tracking on every alloc/free, periodically walk the free lists and identify pages where all items are free. This keeps the hot path untouched but adds a background cost and latency before memory is returned. It also requires being able to map from an item address back to its page and from a page to all items it contains.
Hybrid: Track at the POOL_ALIGN (1 KB block) level rather than the OS page level. The pool already allocates in POOL_ALIGN chunks for small size classes, so the block boundaries are known. When all items in a block are free, decommit it. This is coarser than page-level tracking but aligns with the existing allocation structure.
Scope and constraints
Any solution needs to:
What this discussion is for
This is a research placeholder. The problem is well-understood, the solutions are known from other allocators, but the implementation is non-trivial and needs careful design work. Capturing the analysis here so it doesn't get lost.
Beta Was this translation helpful? Give feedback.
All reactions