- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 409
Use static for count class lookup tables #3453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use static for count class lookup tables #3453
Conversation
This improves performance on fuzzbench with libpng on an m1 mac mini. Before: 98.65k execs/sec After: 115.0k execs/sec
| 
 Should we hide hitcount behind a feature flag then? For more memory constraint environments this may make a difference | 
| I'll take a look at it tonight (~10 hours from now for me). There's a chance the linker strips the symbol if it's unused. Otherwise yeah a feature flag like  | 
| Ah right there's a chance at least with LTO it could get stripped, good point! Thanks :) | 
|  | ||
| // 2022-07: Adding `enumerate` here increases execution speed/register allocation on x86_64. | ||
| #[expect(clippy::unused_enumerate_index)] | ||
| for (_i, item) in map16[0..cnt].iter_mut().enumerate() { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC (and see comment above) this made quite a measurable difference on x86, although the compilers may have advanced since (and static map may also make a difference). @tokatoka can you take another look, potentially?
| I would vote for  | 
| @domenukk It's stripped from the binary. I tested with baby_fuzzer using  The symbol is missing by default After applying this diff: I see the symbols: @wtdcode I was curious and I benchmarked  Repro | 
| Lazy static does initialisation on first access and it seems that it is included in your benchmark section? | 
| cargo bench does 3s of warmup runs, so it should be initialized by the time measurements are taken. From looking at the asm,  In contrast, static only performs a  Relevant asm snippets: 
 Makes sense that a hot loop of loads would show slowness when another load is added to the mix. I don't think the  Anyways since the static is stripped from binaries that don't use it I don't think there's any drawback to always using a simple  | 
| Thanks for detailed investigation and efforts. Yeah for sure, if it is optimized away then it is totally fine then =). | 
| Thank you! | 
| size is not a problem. currently ./target already takes 30GBs. | 
| 
 I was mainly talking about the loop optimization for x86, we may still want it(?) | 
Description
I changed
COUNT_CLASS_LOOKUP_16to usestaticinstead of dynamically constructing the 16bit lookup table (~130 kB) and saw a ~12.4% improvement on libpng fuzzbench when testing locally on my M1 mac mini.This cleans up the API a bit since we can drop
init_count_class_16(). I also got rid of the unused enumerate added here. cc @tokatoka since I don't have an x86 machine on hand to test with.Here's my (unscientific) benchmarking, 2 minute runs:
The only drawback here is the memory usage - we'll be using ~130kB even if we never read
COUNT_CLASS_LOOKUP_16. I thought about using a OnceLock but that wouldn't work forno_std.Checklist
./scripts/precommit.shand addressed all comments^ The main branch does not pass on
cargo 1.92.0-nightly (367fd9f21 2025-10-15). Should I fix up the issues unrelated to my changeset or just rebase once they've been fixed on main?