-
Notifications
You must be signed in to change notification settings - Fork 1k
Add a local allocator within the runtime #8992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/// The size of the local heap. | ||
/// | ||
/// This should be as big as possible, but it should still leave enough space | ||
/// under the maximum memory usage limit to allow the host allocator to service the host calls. | ||
const LOCAL_HEAP_SIZE: usize = 64 * 1024 * 1024; | ||
const LOCAL_HEAP_S: picoalloc::Size = picoalloc::Size::from_bytes_usize(LOCAL_HEAP_SIZE).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might make calculations of available memory even more complicated, right? We assume 128MiB of runtime memory. But half is now available for normal allocations. And the other half for host functions that return host allocated buffers?
So I need to take into account how I am allocating buffers, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
This will still use the host allocator as a fallback if you run out of memory in the local allocator, so previously you had 128MB available and now you still have 128MB available, except since this new allocator supports much more granular allocations (old allocator will round up a 16kB + 1 byte allocation to 32kB; this allocator will actually allocate only ~16kB) and it properly reuses memory (old allocator won't let you reuse a 16kB allocation except for another 16kB allocation, this one will), so this should strictly increase the effective amount of memory available to the guest. (The only exception here is that this new allocator has higher overhead for very small allocations, but considering it's much better behaved it should still be a net win.)
The only thing it reduces is the amount of memory available for use by hostcalls (allocations from within the host which return data to the guest through a hostcall), but as long as the hostcall allocations don't need more than 64MB then it should work (and if they do need more then we could reduce the size of the local heap and still benefit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. It still leaves me wondering how exactly I can improve my memory allocation calculations based off this new memory allocator. SRLabs recommended to assume that memory allocations on the host allocator can take up to 4x their actual size in the worst case.
Currently, we dedicate 2MiB to each stack frame in pallet_revive. So I actually subtract 8MiB from the available memory per call frame even though the 2MiB is comprised by multiple smaller allocations (depending on the PolkaVM allocation patterns). We assume 64MiB as available memory to the call stack (half of all the memory).
The majority of memory is taken up by:
- The compiled code of the interpreter (20 bytes per instruction)
- The flat map which stores the PC -> compiled offset mappings (4x PolkaVM blob size)
- Data/Stack declared by the PolkaVM blob
Host allocations are dominated by reading the raw PolkaVM blob from storage. But this pointer is freed once we created the compiled module from it. So only one of them is in memory at the same time. And not one per call frame.
With this PR those will be allocated by the new in-runtime allocator as there are no host functions involved. Since you know your picoalloc
best: Do we need a similar security factor (currently 4x)? And if yes how high?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SRLabs recommended to assume that memory allocations on the host allocator can take up to 4x their actual size in the worst case. Since you know your
picoalloc
best: Do we need a similar security factor (currently 4x)? And if yes how high?
For non-small allocations the overhead for picoalloc
is effectively none.
picoalloc
allocates in chunks of 32 bytes, and it has 32 byte overhead per allocation, so tiny allocations have quite substantial overhead (although now that I think about it I could probably reduce that overhead if necessary). On the other hand non-tiny allocations have virtually no overhead.
Here's a table I generated; on the left you have the requested allocation size, on the right you have the amount of memory actually used (with the overhead factor in the parens):
1..=32 -> 64 (64.0000..=2.0000)
33..=64 -> 96 (2.9091..=1.5000)
65..=96 -> 128 (1.9692..=1.3333)
97..=128 -> 160 (1.6495..=1.2500)
129..=160 -> 192 (1.4884..=1.2000)
161..=192 -> 224 (1.3913..=1.1667)
193..=224 -> 256 (1.3264..=1.1429)
225..=256 -> 288 (1.2800..=1.1250)
257..=288 -> 320 (1.2451..=1.1111)
289..=320 -> 352 (1.2180..=1.1000)
321..=352 -> 384 (1.1963..=1.0909)
353..=384 -> 416 (1.1785..=1.0833)
385..=416 -> 448 (1.1636..=1.0769)
417..=448 -> 480 (1.1511..=1.0714)
449..=480 -> 512 (1.1403..=1.0667)
481..=512 -> 544 (1.1310..=1.0625)
513..=544 -> 576 (1.1228..=1.0588)
...
1025..=1056 -> 1088 (1.0615..=1.0303)
...
2049..=2080 -> 2112 (1.0307..=1.0154)
...
4097..=4128 -> 4160 (1.0154..=1.0078)
...
8129..=8160 -> 8192 (1.0078..=1.0039)
...
16609..=16640 -> 16672 (1.0038..=1.0019)
...
...
1048449..=1048480 -> 1048512 (1.0001..=1.0000)
1048481..=1048512 -> 1048544 (1.0001..=1.0000)
1048513..=1048544 -> 1048576 (1.0001..=1.0000)
1048545..=1048575 -> 1048608 (1.0001..=1.0000)
So as you can see the bigger the allocation is the less memory is wasted; at allocations of size ~4k the overhead is only ~1%.
(With the caveat that for Vec
s they're by default resized in powers-of-two, so you need to take that into the account or call shrink_to_fit
if you're allocating them, in which case picoalloc
can always efficiently shrink them in-place.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really in control of the allocation patterns here. The overwhelming majority of the memory is consumed by PolkaVM Module
and Instance
. If those allocate those sections as a few larger allocations we are good. And if they don't allow guests to trigger a lot of small allocations. (sbrk is banned).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we'd add a shrink_to_fit
to PolkaVM for those then, yes, this would cut down on their memory usage and allow them to use almost exactly the amount of memory they need, without any security factor necessary.
It's just a default value; the current on-chain value for Polkadot and Kusama is 8192 pages (512 Mb). Otherwise, it's an interesting approach and I think we should give it a try 👍 |
So for PVF the value is stored on-chain. For parachain nodes there is a default value. Can it still be changed via CLI? |
No it can not be changed. Also the value on the relay chain validators should be bigger on purpose. |
Its 512 vs. 128. Mainly to account for the PoV being stored in runtime memory, right? |
Yes exactly. |
I will test the new allocator in a full node which will run based on a runtime that uses it.
@koute what are your thoughts, is the approach on the right path? |
For (1): easiest/quickest way to do it is probably directly in the log::debug!("Compiling runtime with hash: {runtime_hash}");
if let Ok(target) = std::env::var("DEBUG_FORCE_REPLACE_RUNTIME") {
let mut found = false;
for chunk in target.split(",") {
let xs = chunk.split("=");
let target_hash = xs.next().unwrap();
if runtime_hash == target_hash {
let replacement_runtime_path = xs.next().unwrap();
let replacement_runtime_hash = ...;
log::warn!("Force-replacing runtime {runtime_hash} with: {replacement_runtime_path} ({replacement_runtime_hash})");
let replacement_runtime = std::fs::read(replacement_runtime_path).unwrap();
runtime_blob = replacement_runtime;
found = true;
break;
}
}
if !found {
log::info!("DEBUG_FORCE_REPLACE_RUNTIME was set, but runtime {runtime_hash} was not specified; continuing with the original runtime...")
}
} So then you'd run the node once to check what the hash is, and then restart it with For (2): in
You also want to add a similar log to Run this on the original runtime, then we can write a small script to parse these logs into e.g. JSON, and then write a small simulator that can run both allocators (the old one and the new one) at the same time using exactly the same allocation patterns and then it'll be possible to calculate the fragmentation, how much memory it'd save, etc. |
Hey @koute ! Sounds like a plan, thanks for the detailed input. Will follow more concretely through the steps and let you know if any blockers. |
Sure, hit me up if you have any further questions. And as I've said, you could probably actually make a PR with both (1) and (2) and permanently merge these, since this should be relatively small and unintrusive, and it could be useful for further debugging in the future. |
|
Just a quick update about the testing of the new allocator. Simulating allocations like in a production network, to compare it with the host allocator
Running a smoke test where local allocator runs in a production network's runtime, in a full node, without catastrophic errors
The above errors are mixed with other logs like:
@bkchr pointed out that the runtime I am trying to override with is excessively big (>70MiB). This happens for every rustc version >= 1.84. Building with < 1.84 fails because
|
@koute just a random idea. If the runtime's heap is exhausted, we're falling back to using the host allocator. What if, instead of that, we were falling back to allocating one more heap for the runtime? That would add some complexity of managing multiple heaps, but if that additional heap allocation protocol were well-defined, it would be fully deterministic, and that determinism could be preserved through parametrizing that protocol and storing parameter values in |
@s0me0ne-unkn0wn Could be done, but if it requires procotol-level changes then I'm not sure whether it's worth it. |
Quick update related to AHP allocations simulation with both allocators. Polkadot related allocations of the full node will be posted in a follow up. AHP full node left running for 1h
Runtime: 6e20bc52aaaafd1de82ba7d2a3c0fa39193787e240b93608489cf72a4c46a584 (this the current on chain runtime on AHP)
Runtime: e260d17fcfa34f10503c91148a7bc2fd820e356295d2e18f828b5fa4190d47f7 (I am not sure which runtime is this, maybe some older version which was used for some syncing?)
Some initial observations
|
I don't understand. I am seeing |
Yes, this is how parachains are working. The PVF (Parachain validation function) is just the runtime of the parachain and the relay chain executes this to verify that the state transition is valid. |
I think I need a quick primer on how to interpret this data.
Why is there (unallocated space)? If this is requested from the allocator, how is it unallocated?
This is the key metric we want to see minimized right?
What are extra bytes here? |
These are the number of bytes (assuming a completely random allocation pattern, which is unrealistic, but it is a metric) that can still be allocated after the test (so basically how many extra bytes can we still allocate after the final state).
Not necessarily. This is essentially how many physical memory pages were touched. This is not observable on-chain (except maybe timing-wise because allocating physical pages takes a little bit of extra time), and (as long as the host system doesn't run out of memory) doesn't affect how much memory can be allocated inside the runtime. |
@iulianbarbu I created an "exploit" contract that uses the maximum amount of memory possible. By using maximum amount of instructions and data sections and then recurse by filling the call stack: https://github.com/paritytech/memory_exhaustion You need to run it against the kitchensink node on top of: #9267 Can you try with the old and new allocator and observe if the allocation patterns look safe. i.e if enough memory is left or if this is too close to going oom? |
// | ||
// This should be relatively cheap as long as all of this space is full of zeros, | ||
// since none of this memory will be physically paged-in until it's actually used. | ||
static LOCAL_HEAP: LocalHeap = LocalHeap(UnsafeCell::new([0; LOCAL_HEAP_SIZE])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize before that you're getting that heap compiled into your Wasm module. That is, literally, you get a 64 Mb sized global in your on-disk module's bss
segment filled with zeroes. May pose a problem with real values 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, does this mean the runtime size is 64mb larger? Building AHP with local allocator gets us to ~70mb runtime. Building it without the locall allocator commits results in ~6mb AHP runtime. This would explain it, but I also believe the object shouldn't contain the full zero filled heap in bss (some kind of optimization should be used to represent the area, not place actual zeros in the object). Not sure how this can be checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, and it doesn't get compressed because it's over the code bomb size limit, so you're putting all the 70 Mb on chain without any compression 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In wasm there is no bss. Just memory and initializers. I assume Rust generates those data initializers with zeroes and nothing in between seems to optimize this out. They can be removed because the memory is always zero initialized. We are running wasm-opt
but AFAIK we disable optimizations. You can try to enable them (in wasm-builder) and see if it removes the zeroes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, and it doesn't get compressed because it's over the code bomb size limit, so you're putting all the 70 Mb on chain without any compression 🤦
Yes, and I am not sure in which kind of issues we can end up. For example, I tried running a full node AHP local allocator based, with such a big runtime, and got into errors like:
2025-07-28 11:55:07.149 DEBUG tokio-runtime-worker code-provider::overrides: [Parachain]using WASM override block=0x2c4f2a965d343d49b627dfa733b7857d906f3161c31d253c19c8e1645e972ca7
2025-07-28 11:55:07.162 WARN tokio-runtime-worker sync: [Parachain] 💔 Verification failed for block 0x618bbca94a98b8701ca0e1023c8aad3659a5bac2ea11cd2c91167c3e81118698 received from (12D3KooWDR9M7CjV1xdjCRbRwkFn1E7sjMaL4oYxGyDWxuLrFc2J): "Could not fetch authorities at 0x2c4f2a965d343d49b627dfa733b7857d906f3161c31d253c19c8e1645e972ca7: Current state of blockchain has invalid authorities set"
Seems unrelated to the big runtime size, but it might be. Not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In wasm there is no bss. Just memory and initializers.
Exactly, the memory is initialized from segments in the data section, and Wasmtime names them .rodata
, .data
, and .bss
respectively. I'm not sure if that naming is anything standard in Wasm world, but the idea is clear. And if LLVM is dumb enough to initialize 64 Mb zeroed memory from a vector compiled into the blob, I'd doubt if wasm-opt will do any better job optimizing it out. Indeed, we do not need that; we just need dynamic heap allocation. We'll need it in RFC-4 world anyway, as for PVFs, the heap size is stored on chain and may vary from session to session. So, for now, it's just a funny incident in experimental code (I came across it trying to increase the heap to 640 Mb and ending up with the linker not being able to link the polkadot
binary as every relay chain runtime has grown over 640 Mb and all of them compiled into the polkadot
embedded chainspecs at the same time made up an over 2 Tb binary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if LLVM is dumb enough to initialize 64 Mb zeroed memory from a vector compiled into the blob, I'd doubt if wasm-opt will do any better job optimizing it out.
I wouldn't be so sure. The precise reason for the existence of wasm-opt is that LLVM is really bad at optimizing wasm. LLVM was never meant to deal with stack machines. Having a wasm backend is a giant hack.
Indeed, we do not need that; we just need dynamic heap allocation.
We are fine with the static buffer for now if we can optimize out the zeroes. IF wasm-opt doesn't help we use unitialized memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick update. By now I've tried:
wasm-opt
'sOptimizeLevel
up toLevel4
(max available)wasm-opt
'sShrinkLevel
up toLevel2
(max available)- Different combinations of
UnsafeCell
andMaybeUninit
with no success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works:
.add_pass(wasm_opt::Pass::MemoryPacking)
.zero_filled_memory(true)
By default it does not assume that imported memory is zero initialized. We should also only add this specific pass and not enable more optimizations. Those are slow and barely help (we benchmarked a while ago).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above optimization fixes the AHP runtime size. I confirm that an AHP full node with a runtime that uses this local allocator + fallback to host allocator is succesfully importing blocks.
Another update for running an AHP full node with local allocator + host allocator fallback
@athei @koute @s0me0ne-unkn0wn wdyt? |
Agreed. The space wasted on memory padding is fine for the worst case contract we tested. |
This is an experimental PR which adds a local allocator to the runtime.
Why?
The current allocator is known to be... not very good; it fragments memory and wastes a ton of memory (e.g. if you allocate a big vector and deallocate it then that memory cannot be reused for smaller allocations) and it doesn't respect alignment. Unfortunately, it lives on the host so we have to live with it.
There's an effort underway to remove the host allocator, but as that's a protocol-level change it's going to take some time, while we'd like to have a better allocator right now (our recently deployed smart contracts on Kusama have quite heavy limits on the size of contracts which are allowed, in big part because of our crappy allocator).
So... how about we have two allocators?
So here's what we could do: preallocate a static buffer inside of our runtime, and use that to service allocations from within the runtime, bypassing the host allocator completely and only use it for allocations originating from the host (and those which overflow our local allocator).
But you may ask - won't this increase memory usage for every instantiation and slow everything down? Well, not necessarily! Here are the benchmarks:
Benchmark results...
So as you can see this does heavily regress performance.... when copy-on-write pooling is not used, and we do use it by default! When copy-on-write pooling is used there's no difference in instantiation performance (even though we've "allocated" a 64MB chunk of static memory) nor in actual memory usage (the memory pages are lazily allocated, so they're not physically allocated until they're touched), and the new local allocator is up to ~20% faster than the current host allocators so this actually improves performance!
And unlike the host allocator this new allocator (which I wrote) will properly reuse memory, supports fancy features like in-place reallocation when possible and properly handles alignment, all while having constant-time allocation and deallocation.
So what's the catch?
As far as I know this should probably work, as long as we make sure that there's still enough memory for the host allocator to service the host functions within the maximum memory limit. For PVFs as far as I can see the limit is around ~128MB so using half of that for local allocations and half of that for hostcalls might be reasonable? But of course we would have to properly test it out before pushing something like this to production.
cc @athei @TorstenStueber @paritytech/sdk-node @s0me0ne-unkn0wn @bkchr