Skip to content

Conversation

koute
Copy link
Contributor

@koute koute commented Jun 26, 2025

This is an experimental PR which adds a local allocator to the runtime.

Why?

The current allocator is known to be... not very good; it fragments memory and wastes a ton of memory (e.g. if you allocate a big vector and deallocate it then that memory cannot be reused for smaller allocations) and it doesn't respect alignment. Unfortunately, it lives on the host so we have to live with it.

There's an effort underway to remove the host allocator, but as that's a protocol-level change it's going to take some time, while we'd like to have a better allocator right now (our recently deployed smart contracts on Kusama have quite heavy limits on the size of contracts which are allowed, in big part because of our crappy allocator).

So... how about we have two allocators?

So here's what we could do: preallocate a static buffer inside of our runtime, and use that to service allocations from within the runtime, bypassing the host allocator completely and only use it for allocations originating from the host (and those which overflow our local allocator).

But you may ask - won't this increase memory usage for every instantiation and slow everything down? Well, not necessarily! Here are the benchmarks:

Benchmark results...
call_empty_function_from_test_runtime_with_recreate_instance_vanilla_on_1_threads
                        time:   [48.383 ms 48.701 ms 49.040 ms]
                        change: [+126138% +127908% +129601%] (p = 0.00 < 0.05)
                        Performance has regressed.

call_empty_function_from_test_runtime_with_recreate_instance_cow_fresh_on_1_threads
                        time:   [54.612 µs 55.185 µs 55.832 µs]
                        change: [+26.559% +29.659% +32.389%] (p = 0.00 < 0.05)
                        Performance has regressed.

call_empty_function_from_test_runtime_with_recreate_instance_cow_precompiled_on_1_threads
                        time:   [55.338 µs 55.763 µs 56.167 µs]
                        change: [+22.147% +28.098% +32.512%] (p = 0.00 < 0.05)
                        Performance has regressed.

call_empty_function_from_test_runtime_with_pooling_vanilla_fresh_on_1_threads
                        time:   [44.845 ms 45.297 ms 45.793 ms]
                        change: [+395945% +401661% +407435%] (p = 0.00 < 0.05)
                        Performance has regressed.

call_empty_function_from_test_runtime_with_pooling_vanilla_precompiled_on_1_threads
                        time:   [45.942 ms 46.381 ms 46.870 ms]
                        change: [+411440% +416369% +421523%] (p = 0.00 < 0.05)
                        Performance has regressed.

call_empty_function_from_test_runtime_with_pooling_cow_fresh_on_1_threads
                        time:   [10.091 µs 10.156 µs 10.226 µs]
                        change: [-4.2348% -2.5816% -0.8566%] (p = 0.00 < 0.05)
                        Change within noise threshold.

call_empty_function_from_test_runtime_with_pooling_cow_precompiled_on_1_threads
                        time:   [10.031 µs 10.165 µs 10.361 µs]
                        change: [-2.1798% -0.9575% +0.2739%] (p = 0.13 > 0.05)
                        No change in performance detected.

dirty_1mb_of_memory_from_test_runtime_with_recreate_instance_vanilla_on_1_threads
                        time:   [48.740 ms 48.968 ms 49.195 ms]
                        change: [+9155.6% +9258.5% +9360.2%] (p = 0.00 < 0.05)
                        Performance has regressed.

dirty_1mb_of_memory_from_test_runtime_with_recreate_instance_cow_fresh_on_1_threads
                        time:   [521.72 µs 523.74 µs 525.88 µs]
                        change: [-1.0385% -0.0036% +0.9264%] (p = 0.99 > 0.05)
                        No change in performance detected.

dirty_1mb_of_memory_from_test_runtime_with_recreate_instance_cow_precompiled_on_1_threads
                        time:   [516.21 µs 519.87 µs 523.83 µs]
                        change: [-3.1591% -1.9363% -0.7346%] (p = 0.00 < 0.05)
                        Change within noise threshold.

dirty_1mb_of_memory_from_test_runtime_with_pooling_vanilla_fresh_on_1_threads
                        time:   [47.068 ms 47.731 ms 48.417 ms]
                        change: [+9449.4% +9609.3% +9774.7%] (p = 0.00 < 0.05)
                        Performance has regressed.

dirty_1mb_of_memory_from_test_runtime_with_pooling_vanilla_precompiled_on_1_threads
                        time:   [46.603 ms 47.017 ms 47.466 ms]
                        change: [+9600.5% +9736.0% +9871.1%] (p = 0.00 < 0.05)
                        Performance has regressed.

dirty_1mb_of_memory_from_test_runtime_with_pooling_cow_fresh_on_1_threads
                        time:   [472.08 µs 475.39 µs 478.93 µs]
                        change: [-5.2329% -4.2232% -3.2674%] (p = 0.00 < 0.05)
                        Performance has improved.

dirty_1mb_of_memory_from_test_runtime_with_pooling_cow_precompiled_on_1_threads
                        time:   [482.74 µs 486.91 µs 491.44 µs]
                        change: [-2.6321% -1.2422% +0.0606%] (p = 0.08 > 0.05)
                        No change in performance detected.

abuse_the_allocator_from_test_runtime_with_recreate_instance_vanilla_on_1_threads
                        time:   [231.04 ms 234.60 ms 238.76 ms]
                        change: [-10.478% -8.3490% -6.2766%] (p = 0.00 < 0.05)
                        Performance has improved.

abuse_the_allocator_from_test_runtime_with_recreate_instance_cow_fresh_on_1_threads
                        time:   [203.67 ms 207.19 ms 211.25 ms]
                        change: [-20.092% -18.177% -16.244%] (p = 0.00 < 0.05)
                        Performance has improved.

abuse_the_allocator_from_test_runtime_with_recreate_instance_cow_precompiled_on_1_threads
                        time:   [194.80 ms 199.05 ms 204.21 ms]
                        change: [-23.939% -21.690% -19.465%] (p = 0.00 < 0.05)
                        Performance has improved.

abuse_the_allocator_from_test_runtime_with_pooling_vanilla_fresh_on_1_threads
                        time:   [230.91 ms 232.93 ms 235.14 ms]
                        change: [-10.807% -9.2727% -7.7802%] (p = 0.00 < 0.05)
                        Performance has improved.

abuse_the_allocator_from_test_runtime_with_pooling_vanilla_precompiled_on_1_threads
                        time:   [228.64 ms 232.46 ms 236.91 ms]
                        change: [-9.0888% -7.0559% -4.9203%] (p = 0.00 < 0.05)
                        Performance has improved.

abuse_the_allocator_from_test_runtime_with_pooling_cow_fresh_on_1_threads
                        time:   [205.45 ms 209.47 ms 214.30 ms]
                        change: [-21.317% -18.784% -16.172%] (p = 0.00 < 0.05)
                        Performance has improved.

abuse_the_allocator_from_test_runtime_with_pooling_cow_precompiled_on_1_threads
                        time:   [203.81 ms 206.61 ms 210.08 ms]
                        change: [-21.940% -19.911% -17.898%] (p = 0.00 < 0.05)
                        Performance has improved.

So as you can see this does heavily regress performance.... when copy-on-write pooling is not used, and we do use it by default! When copy-on-write pooling is used there's no difference in instantiation performance (even though we've "allocated" a 64MB chunk of static memory) nor in actual memory usage (the memory pages are lazily allocated, so they're not physically allocated until they're touched), and the new local allocator is up to ~20% faster than the current host allocators so this actually improves performance!

And unlike the host allocator this new allocator (which I wrote) will properly reuse memory, supports fancy features like in-place reallocation when possible and properly handles alignment, all while having constant-time allocation and deallocation.

So what's the catch?

As far as I know this should probably work, as long as we make sure that there's still enough memory for the host allocator to service the host functions within the maximum memory limit. For PVFs as far as I can see the limit is around ~128MB so using half of that for local allocations and half of that for hostcalls might be reasonable? But of course we would have to properly test it out before pushing something like this to production.

cc @athei @TorstenStueber @paritytech/sdk-node @s0me0ne-unkn0wn @bkchr

@koute koute added I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. I5-enhancement An additional feature request. T17-primitives Changes to primitives that are not covered by any other label. labels Jun 26, 2025
Comment on lines +30 to +35
/// The size of the local heap.
///
/// This should be as big as possible, but it should still leave enough space
/// under the maximum memory usage limit to allow the host allocator to service the host calls.
const LOCAL_HEAP_SIZE: usize = 64 * 1024 * 1024;
const LOCAL_HEAP_S: picoalloc::Size = picoalloc::Size::from_bytes_usize(LOCAL_HEAP_SIZE).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might make calculations of available memory even more complicated, right? We assume 128MiB of runtime memory. But half is now available for normal allocations. And the other half for host functions that return host allocated buffers?

So I need to take into account how I am allocating buffers, right?

Copy link
Contributor Author

@koute koute Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

This will still use the host allocator as a fallback if you run out of memory in the local allocator, so previously you had 128MB available and now you still have 128MB available, except since this new allocator supports much more granular allocations (old allocator will round up a 16kB + 1 byte allocation to 32kB; this allocator will actually allocate only ~16kB) and it properly reuses memory (old allocator won't let you reuse a 16kB allocation except for another 16kB allocation, this one will), so this should strictly increase the effective amount of memory available to the guest. (The only exception here is that this new allocator has higher overhead for very small allocations, but considering it's much better behaved it should still be a net win.)

The only thing it reduces is the amount of memory available for use by hostcalls (allocations from within the host which return data to the guest through a hostcall), but as long as the hostcall allocations don't need more than 64MB then it should work (and if they do need more then we could reduce the size of the local heap and still benefit).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It still leaves me wondering how exactly I can improve my memory allocation calculations based off this new memory allocator. SRLabs recommended to assume that memory allocations on the host allocator can take up to 4x their actual size in the worst case.

Currently, we dedicate 2MiB to each stack frame in pallet_revive. So I actually subtract 8MiB from the available memory per call frame even though the 2MiB is comprised by multiple smaller allocations (depending on the PolkaVM allocation patterns). We assume 64MiB as available memory to the call stack (half of all the memory).

The majority of memory is taken up by:

  1. The compiled code of the interpreter (20 bytes per instruction)
  2. The flat map which stores the PC -> compiled offset mappings (4x PolkaVM blob size)
  3. Data/Stack declared by the PolkaVM blob

Host allocations are dominated by reading the raw PolkaVM blob from storage. But this pointer is freed once we created the compiled module from it. So only one of them is in memory at the same time. And not one per call frame.

With this PR those will be allocated by the new in-runtime allocator as there are no host functions involved. Since you know your picoalloc best: Do we need a similar security factor (currently 4x)? And if yes how high?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRLabs recommended to assume that memory allocations on the host allocator can take up to 4x their actual size in the worst case. Since you know your picoalloc best: Do we need a similar security factor (currently 4x)? And if yes how high?

For non-small allocations the overhead for picoalloc is effectively none.

picoalloc allocates in chunks of 32 bytes, and it has 32 byte overhead per allocation, so tiny allocations have quite substantial overhead (although now that I think about it I could probably reduce that overhead if necessary). On the other hand non-tiny allocations have virtually no overhead.

Here's a table I generated; on the left you have the requested allocation size, on the right you have the amount of memory actually used (with the overhead factor in the parens):

1..=32 -> 64 (64.0000..=2.0000)
33..=64 -> 96 (2.9091..=1.5000)
65..=96 -> 128 (1.9692..=1.3333)
97..=128 -> 160 (1.6495..=1.2500)
129..=160 -> 192 (1.4884..=1.2000)
161..=192 -> 224 (1.3913..=1.1667)
193..=224 -> 256 (1.3264..=1.1429)
225..=256 -> 288 (1.2800..=1.1250)
257..=288 -> 320 (1.2451..=1.1111)
289..=320 -> 352 (1.2180..=1.1000)
321..=352 -> 384 (1.1963..=1.0909)
353..=384 -> 416 (1.1785..=1.0833)
385..=416 -> 448 (1.1636..=1.0769)
417..=448 -> 480 (1.1511..=1.0714)
449..=480 -> 512 (1.1403..=1.0667)
481..=512 -> 544 (1.1310..=1.0625)
513..=544 -> 576 (1.1228..=1.0588)
...
1025..=1056 -> 1088 (1.0615..=1.0303)
...
2049..=2080 -> 2112 (1.0307..=1.0154)
...
4097..=4128 -> 4160 (1.0154..=1.0078)
...
8129..=8160 -> 8192 (1.0078..=1.0039)
...
16609..=16640 -> 16672 (1.0038..=1.0019)
...
...
1048449..=1048480 -> 1048512 (1.0001..=1.0000)
1048481..=1048512 -> 1048544 (1.0001..=1.0000)
1048513..=1048544 -> 1048576 (1.0001..=1.0000)
1048545..=1048575 -> 1048608 (1.0001..=1.0000)

So as you can see the bigger the allocation is the less memory is wasted; at allocations of size ~4k the overhead is only ~1%.

(With the caveat that for Vecs they're by default resized in powers-of-two, so you need to take that into the account or call shrink_to_fit if you're allocating them, in which case picoalloc can always efficiently shrink them in-place.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really in control of the allocation patterns here. The overwhelming majority of the memory is consumed by PolkaVM Module and Instance. If those allocate those sections as a few larger allocations we are good. And if they don't allow guests to trigger a lot of small allocations. (sbrk is banned).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'd add a shrink_to_fit to PolkaVM for those then, yes, this would cut down on their memory usage and allow them to use almost exactly the amount of memory they need, without any security factor necessary.

@s0me0ne-unkn0wn
Copy link
Contributor

For PVFs as far as I can see the limit is around ~128MB

It's just a default value; the current on-chain value for Polkadot and Kusama is 8192 pages (512 Mb).

Otherwise, it's an interesting approach and I think we should give it a try 👍

@athei
Copy link
Member

athei commented Jun 26, 2025

So for PVF the value is stored on-chain. For parachain nodes there is a default value. Can it still be changed via CLI?

@bkchr
Copy link
Member

bkchr commented Jun 26, 2025

So for PVF the value is stored on-chain. For parachain nodes there is a default value. Can it still be changed via CLI?

No it can not be changed. Also the value on the relay chain validators should be bigger on purpose.

@athei
Copy link
Member

athei commented Jun 26, 2025

Its 512 vs. 128. Mainly to account for the PoV being stored in runtime memory, right?

@bkchr
Copy link
Member

bkchr commented Jun 26, 2025

Yes exactly.

@iulianbarbu
Copy link
Contributor

iulianbarbu commented Jul 11, 2025

I will test the new allocator in a full node which will run based on a runtime that uses it.

  1. For now, I know that I need to inject a new runtime wasm blob based on the experimental allocator at the level of a full node that joins AH-polkadot (maybe polkadot/kusama too?). I also know that usually the wasm blob that's used is not the one part of the chain spec, but the one there plus the runtime upgrades from blocks that are being executed during syncing. However, that's not very relevant, since my belief is that I'd need to bypass the usage of that updated runtime and make calls instead into a wasm blob loaded from disk at the right moment, that points to this newly allocator based runtime that must be present on the node's disk. Finding the code paths where I can hook it is the first task.

  2. Once this node is running and I have some confidence it is using the correct runtime, I'll also need to dump some metrics related to who does the allocations, and what are they, and maybe the block too because if the allocations look awful, we might want to rerun them, or check what that block is made of (and here I might have the choice of dumping all blocks or pick them based on some criteria). A bit unsure how this task will eventually look like, but I'll clarify it when the time comes. Of course, similar aspect should happen on a full node that uses the original runtime (with host allocator), to compare the stats. Any pointers are welcomed (especially because I might complicate things and a simplified version is equally useful).

@koute what are your thoughts, is the approach on the right path?

@koute
Copy link
Contributor Author

koute commented Jul 11, 2025

@iulianbarbu

For (1): easiest/quickest way to do it is probably directly in the sc-executor-wasmtime; in do_create_runtime you can just calculate a hash of the runtime that is to be loaded, and if it matches the hash of the runtime that's currently running on the tip of Polkadot then replace it with your blob. Probably something like this would work (pseudocode with lax error handling; you could probably also just make a PR with something like this and merge it):

log::debug!("Compiling runtime with hash: {runtime_hash}");
if let Ok(target) = std::env::var("DEBUG_FORCE_REPLACE_RUNTIME") {
    let mut found = false;
    for chunk in target.split(",") {
        let xs = chunk.split("=");
        let target_hash = xs.next().unwrap();
        if runtime_hash == target_hash {
            let replacement_runtime_path = xs.next().unwrap();
            let replacement_runtime_hash = ...;
            log::warn!("Force-replacing runtime {runtime_hash} with: {replacement_runtime_path} ({replacement_runtime_hash})");
            let replacement_runtime = std::fs::read(replacement_runtime_path).unwrap();
            runtime_blob = replacement_runtime;
            found = true;
            break;
        }
    }
    if !found {
        log::info!("DEBUG_FORCE_REPLACE_RUNTIME was set, but runtime {runtime_hash} was not specified; continuing with the original runtime...")
    }
}

So then you'd run the node once to check what the hash is, and then restart it with DEBUG_FORCE_REPLACE_RUNTIME=hash=path environment variable set to force-replace the runtime.

For (2): in sp-io you have the malloc/free host call implementations; just put your logs there along with some unique ID so that we can extract the whole allocation history from the logs. The self in those functions is the HostContext trait, so I'd probably add two extra methods to HostContext - fn runtime_hash to get the runtime hash (to make sure we know for which runtime we're measuring) and fn invocation_id which would be a global counter that's incremented every time a new runtime instance is instantiated (i.e. in HostState struct add invocation_id: u64 to its fields). Then you could log these like this:

log::debug!(target: "runtime_allocation_trace", "Allocation trace: {runtime_hash}: {instance_id} malloc: size={size}, pointer=0x{pointer:x}");

You also want to add a similar log to inject_input_data in sp-executor-wasmtime since the allocator there is triggered by the host to inject the initial input data (even when the new local allocator is enabled this would still be done using the old allocator, but we want to have this in the logs for completeness' sake, since without it we won't be able to simulate the exact heap state)

Run this on the original runtime, then we can write a small script to parse these logs into e.g. JSON, and then write a small simulator that can run both allocators (the old one and the new one) at the same time using exactly the same allocation patterns and then it'll be possible to calculate the fragmentation, how much memory it'd save, etc.

@iulianbarbu
Copy link
Contributor

Hey @koute ! Sounds like a plan, thanks for the detailed input. Will follow more concretely through the steps and let you know if any blockers.

@koute
Copy link
Contributor Author

koute commented Jul 11, 2025

Hey @koute ! Sounds like a plan, thanks for the detailed input. Will follow more concretely through the steps and let you know if any blockers.

Sure, hit me up if you have any further questions. And as I've said, you could probably actually make a PR with both (1) and (2) and permanently merge these, since this should be relatively small and unintrusive, and it could be useful for further debugging in the future.

@bkchr
Copy link
Member

bkchr commented Jul 11, 2025

@iulianbarbu

pub wasm_runtime_overrides: Option<PathBuf>,
is what you are searching for.

@iulianbarbu
Copy link
Contributor

iulianbarbu commented Jul 28, 2025

Just a quick update about the testing of the new allocator.

Simulating allocations like in a production network, to compare it with the host allocator

  1. I finalized with the setup for gathering allocations patterns of a running full node on AHPr.
    1.a. @koute gave me a simulator where both allocators are used with random allocations, and instead of the random allocations present there I hooked allocation patterns extracted from the full node logs, to run through them. At this time I am not sure whether the end result of the simulator is interesting as is, but will follow up on this soon after I gather more allocation patterns (see 1.b).
    1.b One challenge I have here is that extracting the allocation patterns requires a lot of disk, and my current setup gets filled fast. I'll have to ask devops for a new machine with 2TB that I can use solely for the testing, for a while, and get back here with some results.

Running a smoke test where local allocator runs in a production network's runtime, in a full node, without catastrophic errors

  1. I've successfully built a 1.5.1 AHP with local allocator (seems that I was trying to build the runtime against an incorrect polkadot-sdk commit - pointing AHP deps to stable2412-6 did it).
    2.a I want to run a full node with this runtime for a while as a smoke test - not sure if enabling local allocation logs from within the runtime wasm blob are interesting, but I might do it - depending on how long we want to leave this running, I might need another dedicated machine.
    2.b Block importing based on the 1.5.1 AHP with local allocator failed with errors below. The uncompressed runtime is quite big (71 with rustc 1.84 and 73MB with rustc 1.86):
2025-07-28 11:55:07.149 DEBUG tokio-runtime-worker code-provider::overrides: [Parachain]using WASM override block=0x2c4f2a965d343d49b627dfa733b7857d906f3161c31d253c19c8e1645e972ca7
2025-07-28 11:55:07.162  WARN tokio-runtime-worker sync: [Parachain] 💔 Verification failed for block 0x618bbca94a98b8701ca0e1023c8aad3659a5bac2ea11cd2c91167c3e81118698 received from (12D3KooWDR9M7CjV1xdjCRbRwkFn1E7sjMaL4oYxGyDWxuLrFc2J): "Could not fetch authorities at 0x2c4f2a965d343d49b627dfa733b7857d906f3161c31d253c19c8e1645e972ca7: Current state of blockchain has invalid authorities set"

The above errors are mixed with other logs like:

2025-07-28 11:55:38.200 DEBUG tokio-runtime-worker code-provider: [Parachain] Neither WASM override nor substitute available, using onchain code block=0x6d6268961eec7a7a3ea27d0d2986cd147086a59d81305af55e302556f8a847ec

@bkchr pointed out that the runtime I am trying to override with is excessively big (>70MiB). This happens for every rustc version >= 1.84. Building with < 1.84 fails because picoalloc uses certain unreleased features. I suspect the root cause is patching the workspace dependencies of polkadot-fellowship/runtimes (with a patch.crates-io section) while pointing to local paths to polkadot-sdk crates (so that I can include the local allocator commits). They seem to add bloat to the AHP 1.5.1 runtime even when building the runtime with --profile production and --features on-chain-build-release (seems like some unoptimized build is done behind the scenes for these patched deps - I haven't confirmed yet). Fixing this is wip.

  1. Follow up with implementations for other synthetic tests (@bkchr had some ideas on an internal chat, we can pick his brain for more input).

@s0me0ne-unkn0wn
Copy link
Contributor

@koute just a random idea. If the runtime's heap is exhausted, we're falling back to using the host allocator. What if, instead of that, we were falling back to allocating one more heap for the runtime? That would add some complexity of managing multiple heaps, but if that additional heap allocation protocol were well-defined, it would be fully deterministic, and that determinism could be preserved through parametrizing that protocol and storing parameter values in ExecutorParams.

@koute
Copy link
Contributor Author

koute commented Jul 29, 2025

just a random idea. If the runtime's heap is exhausted, we're falling back to using the host allocator. What if, instead of that, we were falling back to allocating one more heap for the runtime? That would add some complexity of managing multiple heaps, but if that additional heap allocation protocol were well-defined, it would be fully deterministic, and that determinism could be preserved through parametrizing that protocol and storing parameter values in ExecutorParams.

@s0me0ne-unkn0wn Could be done, but if it requires procotol-level changes then I'm not sure whether it's worth it.

@iulianbarbu
Copy link
Contributor

iulianbarbu commented Jul 31, 2025

Quick update related to AHP allocations simulation with both allocators. Polkadot related allocations of the full node will be posted in a follow up.

AHP full node left running for 1h

  1. I used executor: log host alloc/dealloc #9363 to build polkadot-parachain and start an AHP full node. Extracted both relaychain & parachain allocation logs.
  2. I used a simulator from @koute (which would be great to publish in a repo as an example for how simulations should happen, and even integrate it in our CI for regression testing - @koute WDYT?), updated to read trough a file with many host allocations for AHP runtime, from a full node which run for an hour, synced to the tip at that time.
  3. Below you can see some raw results of the simulation:

Runtime: 6e20bc52aaaafd1de82ba7d2a3c0fa39193787e240b93608489cf72a4c46a584 (this the current on chain runtime on AHP)

Running with allocator: legacy
  Peak allocation count: 5586
  Peak requested space: 25214682
  Peak wasted space on padding: 6788230
  Final allocations (that were not yet deallocated): 4393
  Final requested space (unallocated space): 25045617
  Final wasted space on padding: 6693135
  Physical memory used: 31035392
  Bump allocator stats: AllocationStats { bytes_allocated: 31773896, bytes_allocated_peak: 32044136, bytes_allocated_sum: 571192664, address_space_used: 32248168 }
  Extra bytes allocated: 378385375

Running with allocator: new
  Peak allocation count: 5586
  Peak requested space: 25214682
  Peak wasted space on padding: 89283
  Final allocations (that were not yet deallocated): 4393
  Final requested space (unallocated space): 25045617
  Final wasted space on padding: 68911
  Physical memory used: 25538560
  Bump allocator stats: AllocationStats { bytes_allocated: 0, bytes_allocated_peak: 0, bytes_allocated_sum: 0, address_space_used: 0 }
  Extra bytes allocated: 394245905

Runtime: e260d17fcfa34f10503c91148a7bc2fd820e356295d2e18f828b5fa4190d47f7 (I am not sure which runtime is this, maybe some older version which was used for some syncing?)

Running with allocator: legacy
  Peak allocation count: 7852
  Peak requested space: 32659186
  Peak wasted space on padding: 23185710
  Final allocations (that were not yet deallocated): 7801
  Final requested space (unallocated space): 32604198
  Final wasted space on padding: 23154330
  Physical memory used: 36519936
  Bump allocator stats: AllocationStats { bytes_allocated: 55820936, bytes_allocated_peak: 55907704, bytes_allocated_sum: 283346320, address_space_used: 55953104 }
  Extra bytes allocated: 360474882

Running with allocator: new
  Peak allocation count: 7852
  Peak requested space: 32659186
  Peak wasted space on padding: 147502
  Final allocations (that were not yet deallocated): 7801
  Final requested space (unallocated space): 32604198
  Final wasted space on padding: 146394
  Physical memory used: 39440384
  Bump allocator stats: AllocationStats { bytes_allocated: 0, bytes_allocated_peak: 0, bytes_allocated_sum: 0, address_space_used: 0 }
  Extra bytes allocated: 386472515

Some initial observations

  1. Wasted space on padding for local allocator is ~1% of the space wasted by host allocator for both runtimes.
  2. Actual physical memory used can vary: you can see how in the case of e260d17fcfa34f10503c91148a7bc2fd820e356295d2e18f828b5fa4190d47f7, the local allocator uses more physical memory, while for the case of 6e20bc52aaaafd1de82ba7d2a3c0fa39193787e240b93608489cf72a4c46a584 it uses less. Not sure how to explain this. @koute , any ideas?

@iulianbarbu
Copy link
Contributor

Runtime: e260d17fcfa34f10503c91148a7bc2fd820e356295d2e18f828b5fa4190d47f7 (I am not sure which runtime is this, maybe some older version which was used for some syncing?)

I don't understand. I am seeing e260d17fcfa34f10503c91148a7bc2fd820e356295d2e18f828b5fa4190d47f7 as code hash used by relaychain allocation patterns too. Is it possible for same runtime to be used in both parachain/relaychain context?

@bkchr
Copy link
Member

bkchr commented Jul 31, 2025

Is it possible for same runtime to be used in both parachain/relaychain context?

Yes, this is how parachains are working. The PVF (Parachain validation function) is just the runtime of the parachain and the relay chain executes this to verify that the state transition is valid.

@skunert
Copy link
Contributor

skunert commented Aug 1, 2025

Running with allocator: new
Peak allocation count: 7852
Peak requested space: 32659186
Peak wasted space on padding: 147502
Final allocations (that were not yet deallocated): 7801
Final requested space (unallocated space): 32604198
Final wasted space on padding: 146394
Physical memory used: 39440384
Bump allocator stats: AllocationStats { bytes_allocated: 0, bytes_allocated_peak: 0, bytes_allocated_sum: 0, address_space_used: 0 }
Extra bytes allocated: 386472515

I think I need a quick primer on how to interpret this data.

Final requested space (unallocated space): 32604198

Why is there (unallocated space)? If this is requested from the allocator, how is it unallocated?

Physical memory used: 36519936

This is the key metric we want to see minimized right?

Extra bytes allocated: 386472515

What are extra bytes here?

@koute
Copy link
Contributor Author

koute commented Aug 1, 2025

What are extra bytes here?

These are the number of bytes (assuming a completely random allocation pattern, which is unrealistic, but it is a metric) that can still be allocated after the test (so basically how many extra bytes can we still allocate after the final state).

Physical memory used: 36519936

This is the key metric we want to see minimized right?

Not necessarily. This is essentially how many physical memory pages were touched. This is not observable on-chain (except maybe timing-wise because allocating physical pages takes a little bit of extra time), and (as long as the host system doesn't run out of memory) doesn't affect how much memory can be allocated inside the runtime.

@athei
Copy link
Member

athei commented Aug 3, 2025

@iulianbarbu I created an "exploit" contract that uses the maximum amount of memory possible. By using maximum amount of instructions and data sections and then recurse by filling the call stack: https://github.com/paritytech/memory_exhaustion

You need to run it against the kitchensink node on top of: #9267

Can you try with the old and new allocator and observe if the allocation patterns look safe. i.e if enough memory is left or if this is too close to going oom?

//
// This should be relatively cheap as long as all of this space is full of zeros,
// since none of this memory will be physically paged-in until it's actually used.
static LOCAL_HEAP: LocalHeap = LocalHeap(UnsafeCell::new([0; LOCAL_HEAP_SIZE]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize before that you're getting that heap compiled into your Wasm module. That is, literally, you get a 64 Mb sized global in your on-disk module's bss segment filled with zeroes. May pose a problem with real values 😅

Copy link
Contributor

@iulianbarbu iulianbarbu Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, does this mean the runtime size is 64mb larger? Building AHP with local allocator gets us to ~70mb runtime. Building it without the locall allocator commits results in ~6mb AHP runtime. This would explain it, but I also believe the object shouldn't contain the full zero filled heap in bss (some kind of optimization should be used to represent the area, not place actual zeros in the object). Not sure how this can be checked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, and it doesn't get compressed because it's over the code bomb size limit, so you're putting all the 70 Mb on chain without any compression 🤦

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In wasm there is no bss. Just memory and initializers. I assume Rust generates those data initializers with zeroes and nothing in between seems to optimize this out. They can be removed because the memory is always zero initialized. We are running wasm-opt but AFAIK we disable optimizations. You can try to enable them (in wasm-builder) and see if it removes the zeroes.

Copy link
Contributor

@iulianbarbu iulianbarbu Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, and it doesn't get compressed because it's over the code bomb size limit, so you're putting all the 70 Mb on chain without any compression 🤦

Yes, and I am not sure in which kind of issues we can end up. For example, I tried running a full node AHP local allocator based, with such a big runtime, and got into errors like:

2025-07-28 11:55:07.149 DEBUG tokio-runtime-worker code-provider::overrides: [Parachain]using WASM override block=0x2c4f2a965d343d49b627dfa733b7857d906f3161c31d253c19c8e1645e972ca7
2025-07-28 11:55:07.162  WARN tokio-runtime-worker sync: [Parachain] 💔 Verification failed for block 0x618bbca94a98b8701ca0e1023c8aad3659a5bac2ea11cd2c91167c3e81118698 received from (12D3KooWDR9M7CjV1xdjCRbRwkFn1E7sjMaL4oYxGyDWxuLrFc2J): "Could not fetch authorities at 0x2c4f2a965d343d49b627dfa733b7857d906f3161c31d253c19c8e1645e972ca7: Current state of blockchain has invalid authorities set"

Seems unrelated to the big runtime size, but it might be. Not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In wasm there is no bss. Just memory and initializers.

Exactly, the memory is initialized from segments in the data section, and Wasmtime names them .rodata, .data, and .bss respectively. I'm not sure if that naming is anything standard in Wasm world, but the idea is clear. And if LLVM is dumb enough to initialize 64 Mb zeroed memory from a vector compiled into the blob, I'd doubt if wasm-opt will do any better job optimizing it out. Indeed, we do not need that; we just need dynamic heap allocation. We'll need it in RFC-4 world anyway, as for PVFs, the heap size is stored on chain and may vary from session to session. So, for now, it's just a funny incident in experimental code (I came across it trying to increase the heap to 640 Mb and ending up with the linker not being able to link the polkadot binary as every relay chain runtime has grown over 640 Mb and all of them compiled into the polkadot embedded chainspecs at the same time made up an over 2 Tb binary)

Copy link
Member

@athei athei Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if LLVM is dumb enough to initialize 64 Mb zeroed memory from a vector compiled into the blob, I'd doubt if wasm-opt will do any better job optimizing it out.

I wouldn't be so sure. The precise reason for the existence of wasm-opt is that LLVM is really bad at optimizing wasm. LLVM was never meant to deal with stack machines. Having a wasm backend is a giant hack.

Indeed, we do not need that; we just need dynamic heap allocation.

We are fine with the static buffer for now if we can optimize out the zeroes. IF wasm-opt doesn't help we use unitialized memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick update. By now I've tried:

  • wasm-opt's OptimizeLevel up to Level4 (max available)
  • wasm-opt's ShrinkLevel up to Level2 (max available)
  • Different combinations of UnsafeCell and MaybeUninit

with no success.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works:

.add_pass(wasm_opt::Pass::MemoryPacking)
.zero_filled_memory(true)

By default it does not assume that imported memory is zero initialized. We should also only add this specific pass and not enable more optimizations. Those are slow and barely help (we benchmarked a while ago).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above optimization fixes the AHP runtime size. I confirm that an AHP full node with a runtime that uses this local allocator + fallback to host allocator is succesfully importing blocks.

@iulianbarbu
Copy link
Contributor

iulianbarbu commented Aug 8, 2025

Another update for running an AHP full node with local allocator + host allocator fallback

  1. The node imports blocks successfully after @athei runtime size optimization suggestion, while using the custom AHP runtime with local allocator + host allocator fallback. At this moment I can't leave this node running for long because I warp sync it to the tip, but then leaving it running fills up my disk even if pruning. This is because of Gap Synced blocks are kept even though they are out of pruning window #5119. I am not 100% familiar with this work, but a fix is in the works by @sistemd , starting with store headers and justifications during warp sync #9424 , and I'll continue with my testing afterward.

  2. I think @s0me0ne-unkn0wn is focusing on removing the host allocator entirely here: [WIP] RFC-145 (former RFC-4) implementation #8866. I think though that it will take a bit of more time until it is testable/usable (@s0me0ne-unkn0wn please correct me if not).

  3. I don't expect problems around 1., and given the allocs/dealocs simulation which resulted in host allocator being good enough for the kind of limits set in pallet-revive: Raise contract size limit to one megabyte and raise call depth to 25 #9267 (documented here: https://hackmd.io/IY1K3WSjSM63u4jqZy-nng) , for contracts deployment/execution, I think this PR isn't necessarily needed to be merged right now, and we can wait for 2.

@athei @koute @s0me0ne-unkn0wn wdyt?

@athei
Copy link
Member

athei commented Aug 9, 2025

for contracts deployment/execution, I think this PR isn't necessarily needed to be merged right now, and we can wait for 2.

Agreed. The space wasted on memory padding is fine for the worst case contract we tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

I5-enhancement An additional feature request. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T17-primitives Changes to primitives that are not covered by any other label.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants