Use static for count class lookup tables #3453

Carbocarde · 2025-10-19T06:43:33Z

Description

I changed COUNT_CLASS_LOOKUP_16 to use static instead of dynamically constructing the 16bit lookup table (~130 kB) and saw a ~12.4% improvement on libpng fuzzbench when testing locally on my M1 mac mini.

This cleans up the API a bit since we can drop init_count_class_16(). I also got rid of the unused enumerate added here. cc @tokatoka since I don't have an x86 machine on hand to test with.

Here's my (unscientific) benchmarking, 2 minute runs:

M1 mac mini running fuzzbench with libpng.

Prior to any changes (a7316ffe)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 848, objectives: 0, executions: 11841399, exec/sec: 98.65k, stability: 581/581 (100%), edges: 1394/12660 (11%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 914, objectives: 0, executions: 12682198, exec/sec: 105.7k, stability: 581/581 (100%), edges: 1416/12660 (11%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 918, objectives: 0, executions: 12926190, exec/sec: 107.7k, stability: 581/581 (100%), edges: 1406/12660 (11%)

After patch 1 (e9c64d98)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 855, objectives: 0, executions: 13807222, exec/sec: 115.0k, stability: 581/581 (100%), edges: 1414/12660 (11%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 917, objectives: 0, executions: 14282689, exec/sec: 119.0k, stability: 581/581 (100%), edges: 1401/12660 (11%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 941, objectives: 0, executions: 14016957, exec/sec: 116.8k, stability: 581/581 (100%), edges: 1436/12660 (11%)

After patch 2 (9371e739)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 857, objectives: 0, executions: 13723546, exec/sec: 114.3k, stability: 581/581 (100%), edges: 1403/12660 (11%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 897, objectives: 0, executions: 14136761, exec/sec: 117.8k, stability: 581/581 (100%), edges: 1383/12660 (10%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 874, objectives: 0, executions: 14796040, exec/sec: 123.3k, stability: 581/581 (100%), edges: 1416/12660 (11%)

The only drawback here is the memory usage - we'll be using ~130kB even if we never read COUNT_CLASS_LOOKUP_16. I thought about using a OnceLock but that wouldn't work for no_std.

Checklist

I have run ./scripts/precommit.sh and addressed all comments

^ The main branch does not pass on cargo 1.92.0-nightly (367fd9f21 2025-10-15). Should I fix up the issues unrelated to my changeset or just rebase once they've been fixed on main?

This improves performance on fuzzbench with libpng on an m1 mac mini. Before: 98.65k execs/sec After: 115.0k execs/sec

domenukk · 2025-10-20T12:18:21Z

The only drawback here is the memory usage - we'll be using ~130kB even if we never read COUNT_CLASS_LOOKUP_16. I thought about using a OnceLock but that wouldn't work for no_std.

Should we hide hitcount behind a feature flag then? For more memory constraint environments this may make a difference

Carbocarde · 2025-10-20T15:38:05Z

I'll take a look at it tonight (~10 hours from now for me). There's a chance the linker strips the symbol if it's unused. Otherwise yeah a feature flag like hitcount or hitcount_map to gate the entire hitcount_map.rs file should resolve that.

domenukk · 2025-10-20T16:54:46Z

Ah right there's a chance at least with LTO it could get stripped, good point! Thanks :)

domenukk · 2025-10-20T16:56:15Z

crates/libafl/src/observers/map/hitcount_map.rs


-    // 2022-07: Adding `enumerate` here increases execution speed/register allocation on x86_64.
-    #[expect(clippy::unused_enumerate_index)]
-    for (_i, item) in map16[0..cnt].iter_mut().enumerate() {


IIRC (and see comment above) this made quite a measurable difference on x86, although the compilers may have advanced since (and static map may also make a difference). @tokatoka can you take another look, potentially?

wtdcode · 2025-10-21T00:34:51Z

I would vote for lazy_static which offers a spin feature for no_std environment.

Carbocarde · 2025-10-21T03:13:09Z

@domenukk It's stripped from the binary. I tested with baby_fuzzer using cargo build --release:

nm -C ./target/release/baby_fuzzer | grep "COUNT_CLASS"

The symbol is missing by default

After applying this diff:

diff --git a/fuzzers/baby/baby_fuzzer/src/main.rs b/fuzzers/baby/baby_fuzzer/src/main.rs
index 58a875e0..1cd98c1c 100644
--- a/fuzzers/baby/baby_fuzzer/src/main.rs
+++ b/fuzzers/baby/baby_fuzzer/src/main.rs
@@ -15,7 +15,7 @@ use libafl::{
     generators::RandPrintablesGenerator,
     inputs::{BytesInput, HasTargetBytes},
     mutators::{havoc_mutations::havoc_mutations, scheduled::HavocScheduledMutator},
-    observers::StdMapObserver,
+    observers::{HitcountsMapObserver, StdMapObserver},
     schedulers::QueueScheduler,
     stages::mutational::StdMutationalStage,
     state::StdState,
@@ -66,7 +66,9 @@ pub fn main() {
     // Create an observation channel using the signals map
     // TODO: This will break soon, fix me! See https://github.com/AFLplusplus/LibAFL/issues/2786
     #[allow(static_mut_refs)] // only a problem in nightly
-    let observer = unsafe { StdMapObserver::from_mut_ptr("signals", SIGNALS_PTR, SIGNALS.len()) };
+    let observer = HitcountsMapObserver::new(unsafe {
+        StdMapObserver::from_mut_ptr("signals", SIGNALS_PTR, SIGNALS.len())
+    });

     // Feedback to rate the interestingness of an input
     let mut feedback = MaxMapFeedback::new(&observer);

I see the symbols:

nm -C ./target/release/baby_fuzzer | grep "COUNT_CLASS"
0000000100226da3 s libafl::observers::map::hitcount_map::COUNT_CLASS_LOOKUP::h1c084a15080e9088
00000001002274e2 s libafl::observers::map::hitcount_map::COUNT_CLASS_LOOKUP_16::hceb7809fb1480794

@wtdcode I was curious and I benchmarked lazy_static against static. Looks like the overhead for checking if a lazy_static has been initialized is decently high:

     Running benches/lazy_vs_static.rs (target/release/deps/lazy_vs_static-de047986fb82b0f3)
lookup_read/lazy_static_lookup
                        time:   [18.537 µs 18.782 µs 19.097 µs]
Found 17 outliers among 100 measurements (17.00%)
  8 (8.00%) high mild
  9 (9.00%) high severe
lookup_read/static_lookup
                        time:   [8.7860 µs 8.8101 µs 8.8389 µs]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Repro

use criterion::{Criterion, black_box, criterion_group, criterion_main};
use lazy_static::lazy_static;
use rand::{Rng, SeedableRng, rngs::StdRng};

const SAMPLE_COUNT: usize = 16 * 1024;

static COUNT_CLASS_LOOKUP: [u8; 256] = [
    0, 1, 2, 4, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 16, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
    32, 32, 32, 32, 32, 32, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64,
    64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64,
    64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64,
    64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64,
    64, 64, 64, 64, 64, 64, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
];

const fn build_lookup_table() -> [u16; 256 * 256] {
    let mut seq = [0u16; 256 * 256];
    let mut lo_bits = 0;
    let mut hi_bits = 0;
    while lo_bits < 256 {
        while hi_bits < 256 {
            seq[hi_bits << 8 | lo_bits] =
                (COUNT_CLASS_LOOKUP[hi_bits] as u16) << 8 | COUNT_CLASS_LOOKUP[lo_bits] as u16;
            hi_bits += 1;
        }
        hi_bits = 0;
        lo_bits += 1;
    }
    seq
}

lazy_static! {
    static ref LAZY_LOOKUP_16: [u16; 256 * 256] = build_lookup_table();
}

static STATIC_LOOKUP_16: [u16; 256 * 256] = build_lookup_table();

fn lazy_lookup(index: usize) -> u16 {
    unsafe { *LAZY_LOOKUP_16.get_unchecked(index) }
}

fn static_lookup(index: usize) -> u16 {
    unsafe { *STATIC_LOOKUP_16.get_unchecked(index) }
}

fn random_indices() -> Vec<usize> {
    let mut rng = StdRng::seed_from_u64(0xCAFE_F00D);
    (0..SAMPLE_COUNT)
        .map(|_| rng.gen_range(0..(256 * 256)))
        .collect()
}

fn bench_reads(c: &mut Criterion) {
    let indices = random_indices();
    let mut group = c.benchmark_group("lookup_read");

    group.bench_function("lazy_static_lookup", |b| {
        b.iter(|| {
            for &idx in &indices {
                let value = lazy_lookup(idx);
                black_box(value);
            }
        })
    });

    group.bench_function("static_lookup", |b| {
        b.iter(|| {
            for &idx in &indices {
                let value = static_lookup(idx);
                black_box(value);
            }
        })
    });

    group.finish();
}

criterion_group!(benches, bench_reads);
criterion_main!(benches);

wtdcode · 2025-10-21T03:21:24Z

Lazy static does initialisation on first access and it seems that it is included in your benchmark section?

Carbocarde · 2025-10-21T06:27:32Z

cargo bench does 3s of warmup runs, so it should be initialized by the time measurements are taken.

From looking at the asm, lazy_static has an atomic load (ldapr) followed by a branch into the init code (not taken once initialized) so most of the overhead is likely coming from that atomic load.

In contrast, static only performs a adrp, add, ldrh sequence.

Relevant asm snippets:
lazy_static! (annotated and trimmed to only the path taken when data is initialized):

bench_lazystatic::lazy_lookup:
Lfunc_begin6:
	sub sp, sp, #48			
	stp x29, x30, [sp, #32]	# stash the frame ptr/return addr
	add x29, sp, #32
Lloh10:
	adrp x8, <bench_lazystatic::LAZY_LOOKUP_16 as core::ops::deref::Deref>::deref::__stability::LAZY@PAGE
Lloh11:
	add x8, x8, <bench_lazystatic::LAZY_LOOKUP_16 as core::ops::deref::Deref>::deref::__stability::LAZY@PAGEOFF
	str x8, [sp, #8]		# save the pointer to the actual data
	add x8, x8, #32, lsl #12
	ldapr x8, [x8]			# atomic load to check if initialized
	cbnz x8, LBB6_2			# jump to initialization path if needed
LBB6_1:
	ldr x8, [sp, #8]		# load the pointer that was stashed earlier
	ldrh w0, [x8, #2]		# load the data
	ldp x29, x30, [sp, #32]	# restore frame ptr/return addr
	add sp, sp, #48
	ret
...

static:

bench_lazystatic::static_lookup:
Lfunc_begin7:
	.cfi_startproc
Lloh18:
	adrp x8, bench_lazystatic::STATIC_LOOKUP_16@PAGE
Lloh19:
	add x8, x8, bench_lazystatic::STATIC_LOOKUP_16@PAGEOFF	# load the pointer to the data
	ldrh w0, [x8, x0, lsl #1]								# load the data
	ret

Makes sense that a hot loop of loads would show slowness when another load is added to the mix. I don't think the ldapr's ordering requirements would have much effect in this case since it's the only atomic op but I haven't looked at arm's memory model in a long time so who knows ;)

Anyways since the static is stripped from binaries that don't use it I don't think there's any drawback to always using a simple static here.

wtdcode · 2025-10-21T06:58:58Z

Thanks for detailed investigation and efforts.

Yeah for sure, if it is optimized away then it is totally fine then =).

domenukk · 2025-10-21T10:57:05Z

Thank you!
Let's wait for @tokatoka for the final verdict (especially the x86 stuff)

tokatoka · 2025-10-21T11:11:23Z

size is not a problem. currently ./target already takes 30GBs.
it's good

domenukk · 2025-10-21T13:06:55Z

size is not a problem. currently ./target already takes 30GBs. it's good

I was mainly talking about the loop optimization for x86, we may still want it(?)
#3453 (comment)

Carbocarde and others added 3 commits October 18, 2025 22:34

Use static for count class lookup tables

1e517c0

This improves performance on fuzzbench with libpng on an m1 mac mini. Before: 98.65k execs/sec After: 115.0k execs/sec

Remove unused enumerate index

7a0b447

Merge branch 'main' into static_count_class_lookup

c724759

domenukk reviewed Oct 20, 2025

View reviewed changes

domenukk requested a review from tokatoka October 20, 2025 16:56

tokatoka merged commit 7749cf3 into AFLplusplus:main Oct 21, 2025
109 checks passed

Carbocarde deleted the static_count_class_lookup branch October 22, 2025 17:07

Uh oh!

Use static for count class lookup tables #3453

Use static for count class lookup tables #3453

Uh oh!

Conversation

Carbocarde commented Oct 19, 2025

Description

Checklist

Uh oh!

domenukk commented Oct 20, 2025

Uh oh!

Carbocarde commented Oct 20, 2025

Uh oh!

domenukk commented Oct 20, 2025

Uh oh!

domenukk Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

wtdcode commented Oct 21, 2025

Uh oh!

Carbocarde commented Oct 21, 2025

Uh oh!

wtdcode commented Oct 21, 2025

Uh oh!

Carbocarde commented Oct 21, 2025

Uh oh!

wtdcode commented Oct 21, 2025

Uh oh!

domenukk commented Oct 21, 2025

Uh oh!

tokatoka commented Oct 21, 2025

Uh oh!

Uh oh!

domenukk commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants