Skip to content

<Performance> fuzzbug: Repeated small bounded bulk-memory operations are much slower in Wasmtime than in Wasmer Cranelift #13272

@gaaraw

Description

@gaaraw

Describe the bug

Repeated small bounded bulk-memory operations appear to be much slower in Wasmtime than in Wasmer Cranelift in a minimal microbenchmark family.

I first found this in generated differential tests for memory.copy, then reduced and checked it with a smaller reproducer plus several controls. The slowdown is not limited to one seed, and it is still present after varying copy length and src/dst relation.

test_cases.zip

The clearest primary reproducer I found is:

  • primary_reproducer_memory_copy_len32.wat

Useful supporting controls are:

  • supporting_control_memory_copy_len0.wat
  • supporting_memory_fill_same_shape.wat
  • supporting_memory_copy_len1.wat
  • supporting_memory_copy_len64_safe.wat
  • supporting_memory_copy_src_eq_dst_len32.wat
  • supporting_memory_copy_src_plus1024_len32_safe.wat

Test Case

Primary reproducer loop body:

(local.get $i)
(i32.wrap_i64)
(i32.const 65504)
(i32.and)
(local.get $i)
(i32.wrap_i64)
(i32.const 1431655765)
(i32.xor)
(i32.const 65504)
(i32.and)
(i32.const 32)
(memory.copy)

The reduced reproducer uses:

  • trip count: 2^28
  • one page of memory: (memory 1)
  • both src/dst addresses constrained to a small low-memory window

The closest controls are:

  • same shape, but memory.copy length changed to 0
  • same shape, but memory.copy replaced with memory.fill
  • same shape, but copy lengths changed across 1/4/8/16/32/64
  • same shape, but src/dst relation changed to src == dst and src = dst + 1024

Steps to Reproduce

  1. Build the primary testcase:
wat2wasm primary_reproducer_memory_copy_len32.wat -o primary_reproducer_memory_copy_len32.wasm
  1. Warm up once:
wasmtime primary_reproducer_memory_copy_len32.wasm
  1. Measure runtime:
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_memory_copy_len32.wasm
  1. For comparison, run the same flow on the supporting testcases listed above.

  2. If helpful, compare against Wasmer Cranelift with:

wasmer run primary_reproducer_memory_copy_len32.wasm
perf stat -r 3 -e 'task-clock' wasmer run primary_reproducer_memory_copy_len32.wasm

Expected and actual Results

Primary memory.copy reproducer and close controls

testcase shape wasmer_cranelift (s) wasmtime (s) ratio
control_drop target removed 0.09570 0.08054 0.84x
memory.copy len=0 xor-shaped src/dst, bounded window 0.97108 2.61960 2.70x
memory.copy len=32 xor-shaped src/dst, bounded window 0.76792 2.68820 3.50x
memory.fill len=32 same bounded address shape 0.64743 2.25270 3.48x

Observed pattern:

  • the target-removed control is fast in both runtimes;
  • Wasmtime is already much slower for memory.copy len=0;
  • the slowdown remains for memory.copy len=32;
  • a related bulk-memory instruction (memory.fill) shows a similar gap.

This makes the anomaly look more like bulk-memory helper/runtime-path cost than payload movement cost.

Length sweep for memory.copy

testcase wasmer_cranelift (s) wasmtime (s) ratio
len=0 0.97080 2.73150 2.81x
len=1 0.97112 2.99300 3.08x
len=4 0.89589 2.82370 3.15x
len=8 0.89769 2.81460 3.14x
len=16 0.91569 2.77790 3.03x
len=32 0.76524 2.65780 3.47x
len=64 (safe window) 0.76253 2.68210 3.52x

Observed pattern:

  • from len=0 through len=64, the slowdown ratio stays broadly stable;
  • the main trigger does not seem to be the payload size itself.

Src/dst relation sweep for memory.copy len=32

testcase wasmer_cranelift (s) wasmtime (s) ratio
src == dst 0.75430 2.61270 3.46x
src = dst + 1024 (safe) 0.72937 2.57010 3.52x

Observed pattern:

  • the gap remains even when the copy is self-copy or a fixed-offset in-bounds copy;
  • this does not look specific to the original xor-shaped address relation;
  • this also does not look primarily driven by overlap semantics.

Family-level consistency

The original full-trip generated memory_copy_* seeds all showed wasmtime > wasmer_cranelift:

testcase wasmer_cranelift (s) wasmtime (s) ratio
memory_copy_1 12.1567 39.8686 3.28x
memory_copy_2 13.4606 36.9620 2.75x
memory_copy_3 19.7391 36.3320 1.84x
memory_copy_4 23.0015 36.1513 1.57x
memory_copy_5 9.9472 37.0502 3.72x

Related memory_fill_* seeds also showed the same direction:

testcase wasmer_cranelift (s) wasmtime (s) ratio
memory_fill_1 9.8347 31.9666 3.25x
memory_fill_2 12.0900 32.1992 2.66x
memory_fill_3 12.9405 34.5910 2.67x

Versions and Environment

  • wasmtime: 41.0.0 (4898322 2025-12-18)
  • wasmer: 6.1.0
  • WAMR: iwasm 2.4.4
  • wasmedge: 0.16.1-18-gc457fe30
  • wabt: 1.0.39
  • llvm: 21.1.5
  • Host OS: Ubuntu 22.04.5 LTS x64
  • CPU: 12th Gen Intel® Core™ i7-12700 × 20

Extra Info

For the primary reduced testcase, I also checked Wasmtime CLIF to make sure the benchmark is still alive.

I generated CLIF with:

wasmtime compile -C cache=n --emit-clif out_dir primary_reproducer_memory_copy_len32.wasm

In the generated CLIF for the hot loop, the operation is still lowered through a per-iteration builtin call equivalent to:

call fn0(vmctx, 0, dst, 0, src, len)

The emitted builtin wasmtime_builtin_memory_copy still performs a deeper indirect runtime call:

v11 = call_indirect sig0, v10(v0, v1, v2, v3, v4, v5)

So this does not look like dead-code elimination or a broken benchmark scaffold.

The strongest trigger condition I can currently support is:

  • repeated small bounded bulk-memory operations;
  • one-page memory with a hot low-memory window;
  • slowdown present for both memory.copy and memory.fill;
  • largely independent of copy length (0..64 in this sweep) and src/dst relation.

I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions