Skip to content

Conversation

@Carbocarde
Copy link
Contributor

Description

I changed COUNT_CLASS_LOOKUP_16 to use static instead of dynamically constructing the 16bit lookup table (~130 kB) and saw a ~12.4% improvement on libpng fuzzbench when testing locally on my M1 mac mini.

This cleans up the API a bit since we can drop init_count_class_16(). I also got rid of the unused enumerate added here. cc @tokatoka since I don't have an x86 machine on hand to test with.

Here's my (unscientific) benchmarking, 2 minute runs:

M1 mac mini running fuzzbench with libpng.

Prior to any changes (a7316ffe)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 848, objectives: 0, executions: 11841399, exec/sec: 98.65k, stability: 581/581 (100%), edges: 1394/12660 (11%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 914, objectives: 0, executions: 12682198, exec/sec: 105.7k, stability: 581/581 (100%), edges: 1416/12660 (11%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 918, objectives: 0, executions: 12926190, exec/sec: 107.7k, stability: 581/581 (100%), edges: 1406/12660 (11%)

After patch 1 (e9c64d98)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 855, objectives: 0, executions: 13807222, exec/sec: 115.0k, stability: 581/581 (100%), edges: 1414/12660 (11%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 917, objectives: 0, executions: 14282689, exec/sec: 119.0k, stability: 581/581 (100%), edges: 1401/12660 (11%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 941, objectives: 0, executions: 14016957, exec/sec: 116.8k, stability: 581/581 (100%), edges: 1436/12660 (11%)

After patch 2 (9371e739)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 857, objectives: 0, executions: 13723546, exec/sec: 114.3k, stability: 581/581 (100%), edges: 1403/12660 (11%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 897, objectives: 0, executions: 14136761, exec/sec: 117.8k, stability: 581/581 (100%), edges: 1383/12660 (10%)
[Client Heartbeat #0] run time: 2m-0s, clients: 1, corpus: 874, objectives: 0, executions: 14796040, exec/sec: 123.3k, stability: 581/581 (100%), edges: 1416/12660 (11%)

The only drawback here is the memory usage - we'll be using ~130kB even if we never read COUNT_CLASS_LOOKUP_16. I thought about using a OnceLock but that wouldn't work for no_std.

Checklist

  • I have run ./scripts/precommit.sh and addressed all comments

^ The main branch does not pass on cargo 1.92.0-nightly (367fd9f21 2025-10-15). Should I fix up the issues unrelated to my changeset or just rebase once they've been fixed on main?

Carbocarde and others added 3 commits October 18, 2025 22:34
This improves performance on fuzzbench with libpng on an m1 mac mini.
Before:
98.65k execs/sec
After:
115.0k execs/sec
@domenukk
Copy link
Member

The only drawback here is the memory usage - we'll be using ~130kB even if we never read COUNT_CLASS_LOOKUP_16. I thought about using a OnceLock but that wouldn't work for no_std.

Should we hide hitcount behind a feature flag then? For more memory constraint environments this may make a difference

@Carbocarde
Copy link
Contributor Author

I'll take a look at it tonight (~10 hours from now for me). There's a chance the linker strips the symbol if it's unused. Otherwise yeah a feature flag like hitcount or hitcount_map to gate the entire hitcount_map.rs file should resolve that.

@domenukk
Copy link
Member

Ah right there's a chance at least with LTO it could get stripped, good point! Thanks :)


// 2022-07: Adding `enumerate` here increases execution speed/register allocation on x86_64.
#[expect(clippy::unused_enumerate_index)]
for (_i, item) in map16[0..cnt].iter_mut().enumerate() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC (and see comment above) this made quite a measurable difference on x86, although the compilers may have advanced since (and static map may also make a difference). @tokatoka can you take another look, potentially?

@domenukk domenukk requested a review from tokatoka October 20, 2025 16:56
@wtdcode
Copy link
Member

wtdcode commented Oct 21, 2025

I would vote for lazy_static which offers a spin feature for no_std environment.

@Carbocarde
Copy link
Contributor Author

@domenukk It's stripped from the binary. I tested with baby_fuzzer using cargo build --release:

nm -C ./target/release/baby_fuzzer | grep "COUNT_CLASS"

The symbol is missing by default

After applying this diff:

diff --git a/fuzzers/baby/baby_fuzzer/src/main.rs b/fuzzers/baby/baby_fuzzer/src/main.rs
index 58a875e0..1cd98c1c 100644
--- a/fuzzers/baby/baby_fuzzer/src/main.rs
+++ b/fuzzers/baby/baby_fuzzer/src/main.rs
@@ -15,7 +15,7 @@ use libafl::{
     generators::RandPrintablesGenerator,
     inputs::{BytesInput, HasTargetBytes},
     mutators::{havoc_mutations::havoc_mutations, scheduled::HavocScheduledMutator},
-    observers::StdMapObserver,
+    observers::{HitcountsMapObserver, StdMapObserver},
     schedulers::QueueScheduler,
     stages::mutational::StdMutationalStage,
     state::StdState,
@@ -66,7 +66,9 @@ pub fn main() {
     // Create an observation channel using the signals map
     // TODO: This will break soon, fix me! See https://github.com/AFLplusplus/LibAFL/issues/2786
     #[allow(static_mut_refs)] // only a problem in nightly
-    let observer = unsafe { StdMapObserver::from_mut_ptr("signals", SIGNALS_PTR, SIGNALS.len()) };
+    let observer = HitcountsMapObserver::new(unsafe {
+        StdMapObserver::from_mut_ptr("signals", SIGNALS_PTR, SIGNALS.len())
+    });

     // Feedback to rate the interestingness of an input
     let mut feedback = MaxMapFeedback::new(&observer);

I see the symbols:

nm -C ./target/release/baby_fuzzer | grep "COUNT_CLASS"
0000000100226da3 s libafl::observers::map::hitcount_map::COUNT_CLASS_LOOKUP::h1c084a15080e9088
00000001002274e2 s libafl::observers::map::hitcount_map::COUNT_CLASS_LOOKUP_16::hceb7809fb1480794

@wtdcode I was curious and I benchmarked lazy_static against static. Looks like the overhead for checking if a lazy_static has been initialized is decently high:

     Running benches/lazy_vs_static.rs (target/release/deps/lazy_vs_static-de047986fb82b0f3)
lookup_read/lazy_static_lookup
                        time:   [18.537 µs 18.782 µs 19.097 µs]
Found 17 outliers among 100 measurements (17.00%)
  8 (8.00%) high mild
  9 (9.00%) high severe
lookup_read/static_lookup
                        time:   [8.7860 µs 8.8101 µs 8.8389 µs]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
Repro
use criterion::{Criterion, black_box, criterion_group, criterion_main};
use lazy_static::lazy_static;
use rand::{Rng, SeedableRng, rngs::StdRng};

const SAMPLE_COUNT: usize = 16 * 1024;

static COUNT_CLASS_LOOKUP: [u8; 256] = [
    0, 1, 2, 4, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 16, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
    32, 32, 32, 32, 32, 32, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64,
    64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64,
    64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64,
    64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64,
    64, 64, 64, 64, 64, 64, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
    128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
];

const fn build_lookup_table() -> [u16; 256 * 256] {
    let mut seq = [0u16; 256 * 256];
    let mut lo_bits = 0;
    let mut hi_bits = 0;
    while lo_bits < 256 {
        while hi_bits < 256 {
            seq[hi_bits << 8 | lo_bits] =
                (COUNT_CLASS_LOOKUP[hi_bits] as u16) << 8 | COUNT_CLASS_LOOKUP[lo_bits] as u16;
            hi_bits += 1;
        }
        hi_bits = 0;
        lo_bits += 1;
    }
    seq
}

lazy_static! {
    static ref LAZY_LOOKUP_16: [u16; 256 * 256] = build_lookup_table();
}

static STATIC_LOOKUP_16: [u16; 256 * 256] = build_lookup_table();

fn lazy_lookup(index: usize) -> u16 {
    unsafe { *LAZY_LOOKUP_16.get_unchecked(index) }
}

fn static_lookup(index: usize) -> u16 {
    unsafe { *STATIC_LOOKUP_16.get_unchecked(index) }
}

fn random_indices() -> Vec<usize> {
    let mut rng = StdRng::seed_from_u64(0xCAFE_F00D);
    (0..SAMPLE_COUNT)
        .map(|_| rng.gen_range(0..(256 * 256)))
        .collect()
}

fn bench_reads(c: &mut Criterion) {
    let indices = random_indices();
    let mut group = c.benchmark_group("lookup_read");

    group.bench_function("lazy_static_lookup", |b| {
        b.iter(|| {
            for &idx in &indices {
                let value = lazy_lookup(idx);
                black_box(value);
            }
        })
    });

    group.bench_function("static_lookup", |b| {
        b.iter(|| {
            for &idx in &indices {
                let value = static_lookup(idx);
                black_box(value);
            }
        })
    });

    group.finish();
}

criterion_group!(benches, bench_reads);
criterion_main!(benches);

@wtdcode
Copy link
Member

wtdcode commented Oct 21, 2025

Lazy static does initialisation on first access and it seems that it is included in your benchmark section?

@Carbocarde
Copy link
Contributor Author

cargo bench does 3s of warmup runs, so it should be initialized by the time measurements are taken.

From looking at the asm, lazy_static has an atomic load (ldapr) followed by a branch into the init code (not taken once initialized) so most of the overhead is likely coming from that atomic load.

In contrast, static only performs a adrp, add, ldrh sequence.

Relevant asm snippets:
lazy_static! (annotated and trimmed to only the path taken when data is initialized):

bench_lazystatic::lazy_lookup:
Lfunc_begin6:
	sub sp, sp, #48			
	stp x29, x30, [sp, #32]	# stash the frame ptr/return addr
	add x29, sp, #32
Lloh10:
	adrp x8, <bench_lazystatic::LAZY_LOOKUP_16 as core::ops::deref::Deref>::deref::__stability::LAZY@PAGE
Lloh11:
	add x8, x8, <bench_lazystatic::LAZY_LOOKUP_16 as core::ops::deref::Deref>::deref::__stability::LAZY@PAGEOFF
	str x8, [sp, #8]		# save the pointer to the actual data
	add x8, x8, #32, lsl #12
	ldapr x8, [x8]			# atomic load to check if initialized
	cbnz x8, LBB6_2			# jump to initialization path if needed
LBB6_1:
	ldr x8, [sp, #8]		# load the pointer that was stashed earlier
	ldrh w0, [x8, #2]		# load the data
	ldp x29, x30, [sp, #32]	# restore frame ptr/return addr
	add sp, sp, #48
	ret
...

static:

bench_lazystatic::static_lookup:
Lfunc_begin7:
	.cfi_startproc
Lloh18:
	adrp x8, bench_lazystatic::STATIC_LOOKUP_16@PAGE
Lloh19:
	add x8, x8, bench_lazystatic::STATIC_LOOKUP_16@PAGEOFF	# load the pointer to the data
	ldrh w0, [x8, x0, lsl #1]								# load the data
	ret

Makes sense that a hot loop of loads would show slowness when another load is added to the mix. I don't think the ldapr's ordering requirements would have much effect in this case since it's the only atomic op but I haven't looked at arm's memory model in a long time so who knows ;)

Anyways since the static is stripped from binaries that don't use it I don't think there's any drawback to always using a simple static here.

@wtdcode
Copy link
Member

wtdcode commented Oct 21, 2025

Thanks for detailed investigation and efforts.

Yeah for sure, if it is optimized away then it is totally fine then =).

@domenukk
Copy link
Member

Thank you!
Let's wait for @tokatoka for the final verdict (especially the x86 stuff)

@tokatoka
Copy link
Member

size is not a problem. currently ./target already takes 30GBs.
it's good

@tokatoka tokatoka merged commit 7749cf3 into AFLplusplus:main Oct 21, 2025
109 checks passed
@domenukk
Copy link
Member

size is not a problem. currently ./target already takes 30GBs. it's good

I was mainly talking about the loop optimization for x86, we may still want it(?)
#3453 (comment)

@Carbocarde Carbocarde deleted the static_count_class_lookup branch October 22, 2025 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants