-
Notifications
You must be signed in to change notification settings - Fork 1.2k
PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation #15110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation #15110
Conversation
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
@dweiss, Kindly, can you please merge as well, Thank you! |
Should we do the same in |
Thanks @jpountz, for taking a look at this PR. Yes, MemorySegmentPostingDecodingUtil has the same strided-write pattern. I’ll open a follow-up PR that flips the loops so the innermost store becomes the contiguous |
If this is to be backported then the changes entry should go into 10x. |
Got it @dweiss. moved the change entry from 11.0 to 10.3 section. |
. |
I ran main:
this PR:
I only ran with main:
this PR:
|
Yeah... I'm no longer so convinced we should accept micro-benchmarking results without looking at overall performance. When you run the code in a different context it seems to compile differently. Weird. |
…s and SIMD vectorisation
fe12ff6
to
90835c5
Compare
In this PR, we targeted optimising the scalar path by rearranging loops so that memory accesses are sequential rather than strided, which gave a measurable performance improvement (see comment, the JMH only tests for scalar path). I applied the same optimization to the vectorized path, but it didn’t yield the expected improvement — decodeVector performance actually regressed compared to baseline (as also verified by Adrian). I didn't run microbenchmarks for it prior to this. I suspect this is because, in the mainline vectorized version, we are loading from the memory segment only for that lane. To achieve the same contiguous write pattern as the scalar version, we would need to load the entire vector from memory at once, and this may be causing the regression. I have reproduced this behavior consistently. Please check decodeVector and decodeAndPrefixSumVector below.
Summary of the perf changes:
I also ran the benchmarks across all DPVs with this PR (only PostingDecodingUtil changes, MemorySegmentDecodingUtil reverted). As expected, the scalar version shows consistent improvements, while the vectorized version remains mostly unchanged (as expected, as we didn't change its flow).
Summarising the perf results:
TLDR;Key takeaway:
|
PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation.
Description
Rewrite splitInts() so the innermost loop walks the array contiguously instead of strided manner. Giving chance to HotSpot C2 to vectorise the shift-and-mask operations and also improves cache locality.
No functional change - the temporary buffer
b
is only accessed insidethis method and its layout order is irrelevant to callers.
Performance comparsion on intel i5-13600k
Full JMH results are in the benchmark repo.
Speed ups:
Summarising: