PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation #15110

RamakrishnaChilaka · 2025-08-22T05:22:07Z

PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation.

Description

Rewrite splitInts() so the innermost loop walks the array contiguously instead of strided manner. Giving chance to HotSpot C2 to vectorise the shift-and-mask operations and also improves cache locality.

No functional change - the temporary buffer b is only accessed inside
this method and its layout order is irrelevant to callers.

Performance comparsion on intel i5-13600k

Full JMH results are in the benchmark repo.

Speed ups:

Benchmark                                   (count)  (dec)   Mode  Cnt      Score     Error   Units
BitUnpackerBenchmark.baselineScalar              32      1  thrpt   25   1667.831 _   5.508  ops/ms
BitUnpackerBenchmark.baselineScalar              32      2  thrpt   25   3007.441 _  54.965  ops/ms
BitUnpackerBenchmark.baselineScalar              32      4  thrpt   25   5837.677 _  50.676  ops/ms
BitUnpackerBenchmark.baselineScalar              32      8  thrpt   25  10565.109 _ 133.238  ops/ms
BitUnpackerBenchmark.baselineScalar            1024      1  thrpt   25     13.448 _   0.036  ops/ms
BitUnpackerBenchmark.baselineScalar            1024      2  thrpt   25     25.123 _   0.024  ops/ms
BitUnpackerBenchmark.baselineScalar            1024      4  thrpt   25    210.144 _   2.174  ops/ms
BitUnpackerBenchmark.baselineScalar            1024      8  thrpt   25    379.248 _   9.067  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath       32      1  thrpt   25   3265.182 _  10.690  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath       32      2  thrpt   25   6507.689 _  19.246  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath       32      4  thrpt   25  12516.050 _  72.828  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath       32      8  thrpt   25  25706.735 _ 105.216  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath     1024      1  thrpt   25    132.753 _   0.223  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath     1024      2  thrpt   25    293.396 _   0.527  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath     1024      4  thrpt   25    597.617 _  32.484  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath     1024      8  thrpt   25   1202.047 _  38.059  ops/ms

Summarising:

count	dec	baseline	optimized	factor
32	1	1 668	3 265	1.96×
32	2	3 007	6 508	2.16×
32	4	5 838	12 516	2.14×
32	8	10 565	25 707	2.43×
1024	1	13.4	132.8	9.9×
1024	2	25.1	293.4	11.7×
1024	4	210	597.6	2.85×
1024	8	379	1 202	3.17×

github-actions · 2025-08-22T05:23:01Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

dweiss

Makes sense to me.

RamakrishnaChilaka · 2025-08-22T13:18:14Z

@dweiss, Kindly, can you please merge as well, Thank you!

jpountz · 2025-08-22T14:17:27Z

Should we do the same in MemorySegmentPostingDecodingUtil?

RamakrishnaChilaka · 2025-08-22T14:29:39Z

Thanks @jpountz, for taking a look at this PR.

Yes, MemorySegmentPostingDecodingUtil has the same strided-write pattern. I’ll open a follow-up PR that flips the loops so the innermost store becomes the contiguous b[j * count + i], matching the optimisation.

RamakrishnaChilaka · 2025-08-22T16:30:27Z

@jpountz / @rmuir / @dweiss

Update: instead of a separate follow-up PR, I’ve added the same loop-reordering for MemorySegmentPostingDecodingUtil as the last commit in this PR.
Please review when convenient; I can still extract it into its own PR if you prefer.

dweiss · 2025-08-22T17:08:34Z

If this is to be backported then the changes entry should go into 10x.

RamakrishnaChilaka · 2025-08-22T17:48:31Z

Got it @dweiss. moved the change entry from 11.0 to 10.3 section.

RamakrishnaChilaka · 2025-08-25T17:50:04Z

.

jpountz · 2025-08-25T22:20:26Z

I ran PostingIndexInputBenchmark with the vector module enabled to check performance, but it seems to report a slowdown.

main:

Benchmark                                      (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                  7  thrpt   15  37.939 ± 0.647  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum      7  thrpt   15  18.602 ± 0.413  ops/us

this PR:

Benchmark                                      (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                  7  thrpt   15  22.105 ± 1.001  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum      7  thrpt   15  12.791 ± 0.280  ops/us

I only ran with bpv=7 which is a "hard" number of bits per value, didn't check other numbers as I didn't have much time. I do see a speedup without the vector module enabled though:

main:
(results are very noisy on main and without the vector module enabled, not sure why)

Benchmark                                      (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                  7  thrpt   15  15.439 ± 8.188  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum      7  thrpt   15  11.709 ± 0.897  ops/us

this PR:

Benchmark                                      (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                  7  thrpt   15  30.618 ± 0.519  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum      7  thrpt   15  13.216 ± 0.232  ops/us

dweiss · 2025-08-26T05:43:50Z

Yeah... I'm no longer so convinced we should accept micro-benchmarking results without looking at overall performance. When you run the code in a different context it seems to compile differently. Weird.

…s and SIMD vectorisation

RamakrishnaChilaka · 2025-08-26T12:31:54Z

In this PR, we targeted optimising the scalar path by rearranging loops so that memory accesses are sequential rather than strided, which gave a measurable performance improvement (see comment, the JMH only tests for scalar path).

I applied the same optimization to the vectorized path, but it didn’t yield the expected improvement — decodeVector performance actually regressed compared to baseline (as also verified by Adrian). I didn't run microbenchmarks for it prior to this.

I suspect this is because, in the mainline vectorized version, we are loading from the memory segment only for that lane. To achieve the same contiguous write pattern as the scalar version, we would need to load the entire vector from memory at once, and this may be causing the regression. I have reproduced this behavior consistently.

Please check decodeVector and decodeAndPrefixSumVector below.

contender -1

Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        7  thrpt   15  46.397 ± 1.220  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  22.414 ± 0.049  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  21.056 ± 0.179  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  37.523 ± 0.666  ops/us

contender-2
Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        7  thrpt   15  46.140 ± 0.370  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  22.415 ± 0.042  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  21.322 ± 0.048  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  37.132 ± 1.084  ops/us


baseline

Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        7  thrpt   15  39.971 ± 0.040  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  20.836 ± 0.039  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  24.911 ± 3.529  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  62.210 ± 1.306  ops/us

contender-3

Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        7  thrpt   15  46.313 ± 0.614  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  22.406 ± 0.037  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  17.476 ± 0.302  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  31.087 ± 1.201  ops/us

Summary of the perf changes:

Benchmark	Baseline (ops/us)	Contender-1 (ops/us, %Δ)	Contender-2 (ops/us, %Δ)	Contender-3 (ops/us, %Δ)
decode	39.971	46.397 (+16.1%)	46.140 (+15.4%)	46.313 (+15.9%)
decodeAndPrefixSum	20.836	22.414 (+7.6%)	22.415 (+7.6%)	22.406 (+7.5%)
decodeAndPrefixSumVector	24.911 ± 3.529	21.056 (−15.5%)	21.322 (−14.4%)	17.476 (−29.9%)
decodeVector	62.210	37.523 (−39.7%)	37.132 (−40.3%)	31.087 (−50.0%)

I also ran the benchmarks across all DPVs with this PR (only PostingDecodingUtil changes, MemorySegmentDecodingUtil reverted). As expected, the scalar version shows consistent improvements, while the vectorized version remains mostly unchanged (as expected, as we didn't change its flow).

contender
Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        2  thrpt   15  56.685 ± 0.596  ops/us
PostingIndexInputBenchmark.decode                        3  thrpt   15  50.723 ± 0.091  ops/us
PostingIndexInputBenchmark.decode                        4  thrpt   15  53.631 ± 2.356  ops/us
PostingIndexInputBenchmark.decode                        5  thrpt   15  45.242 ± 0.193  ops/us
PostingIndexInputBenchmark.decode                        6  thrpt   15  47.290 ± 0.061  ops/us
PostingIndexInputBenchmark.decode                        7  thrpt   15  46.388 ± 0.637  ops/us
PostingIndexInputBenchmark.decode                        8  thrpt   15  85.133 ± 0.560  ops/us
PostingIndexInputBenchmark.decode                        9  thrpt   15  32.661 ± 0.499  ops/us
PostingIndexInputBenchmark.decode                       10  thrpt   15  37.306 ± 0.350  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            2  thrpt   15  32.170 ± 0.085  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            3  thrpt   15  30.107 ± 0.039  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            4  thrpt   15  25.955 ± 0.058  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            5  thrpt   15  22.113 ± 0.014  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            6  thrpt   15  21.684 ± 0.006  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  22.410 ± 0.074  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            8  thrpt   15  29.921 ± 0.047  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            9  thrpt   15  21.974 ± 0.034  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum           10  thrpt   15  22.544 ± 0.063  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      2  thrpt   15  35.892 ± 0.182  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      3  thrpt   15  34.986 ± 0.097  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      4  thrpt   15  33.011 ± 0.312  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      5  thrpt   15  32.070 ± 0.044  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      6  thrpt   15  30.406 ± 0.118  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  29.563 ± 0.322  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      8  thrpt   15  33.238 ± 0.159  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      9  thrpt   15  22.095 ± 0.058  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector     10  thrpt   15  26.039 ± 0.085  ops/us
PostingIndexInputBenchmark.decodeVector                  2  thrpt   15  85.634 ± 5.048  ops/us
PostingIndexInputBenchmark.decodeVector                  3  thrpt   15  74.220 ± 2.586  ops/us
PostingIndexInputBenchmark.decodeVector                  4  thrpt   15  91.992 ± 0.821  ops/us
PostingIndexInputBenchmark.decodeVector                  5  thrpt   15  62.661 ± 0.321  ops/us
PostingIndexInputBenchmark.decodeVector                  6  thrpt   15  69.889 ± 1.028  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  61.817 ± 0.806  ops/us
PostingIndexInputBenchmark.decodeVector                  8  thrpt   15  85.631 ± 0.214  ops/us
PostingIndexInputBenchmark.decodeVector                  9  thrpt   15  34.308 ± 0.251  ops/us
PostingIndexInputBenchmark.decodeVector                 10  thrpt   15  49.525 ± 0.180  ops/us

baseline
Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        2  thrpt   15  56.139 ± 0.686  ops/us
PostingIndexInputBenchmark.decode                        3  thrpt   15  47.809 ± 0.121  ops/us
PostingIndexInputBenchmark.decode                        4  thrpt   15  56.922 ± 0.150  ops/us
PostingIndexInputBenchmark.decode                        5  thrpt   15  44.117 ± 0.125  ops/us
PostingIndexInputBenchmark.decode                        6  thrpt   15  41.455 ± 0.139  ops/us
PostingIndexInputBenchmark.decode                        7  thrpt   15  39.891 ± 0.182  ops/us
PostingIndexInputBenchmark.decode                        8  thrpt   15  85.428 ± 0.268  ops/us
PostingIndexInputBenchmark.decode                        9  thrpt   15  29.763 ± 0.259  ops/us
PostingIndexInputBenchmark.decode                       10  thrpt   15  30.598 ± 0.064  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            2  thrpt   15  31.135 ± 0.313  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            3  thrpt   15  27.499 ± 0.056  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            4  thrpt   15  25.458 ± 0.090  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            5  thrpt   15  23.068 ± 0.054  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            6  thrpt   15  22.404 ± 0.023  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  20.858 ± 0.008  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            8  thrpt   15  24.736 ± 0.091  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            9  thrpt   15  19.946 ± 0.073  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum           10  thrpt   15  20.841 ± 0.011  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      2  thrpt   15  33.515 ± 3.208  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      3  thrpt   15  33.492 ± 2.343  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      4  thrpt   15  33.055 ± 0.338  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      5  thrpt   15  31.688 ± 0.247  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      6  thrpt   15  28.542 ± 3.090  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  29.425 ± 0.039  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      8  thrpt   15  33.136 ± 0.037  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      9  thrpt   15  22.099 ± 0.026  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector     10  thrpt   15  25.956 ± 0.064  ops/us
PostingIndexInputBenchmark.decodeVector                  2  thrpt   15  85.568 ± 4.920  ops/us
PostingIndexInputBenchmark.decodeVector                  3  thrpt   15  74.447 ± 2.404  ops/us
PostingIndexInputBenchmark.decodeVector                  4  thrpt   15  87.720 ± 5.573  ops/us
PostingIndexInputBenchmark.decodeVector                  5  thrpt   15  62.400 ± 0.683  ops/us
PostingIndexInputBenchmark.decodeVector                  6  thrpt   15  67.609 ± 0.525  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  62.465 ± 0.409  ops/us
PostingIndexInputBenchmark.decodeVector                  8  thrpt   15  85.587 ± 0.141  ops/us
PostingIndexInputBenchmark.decodeVector                  9  thrpt   15  33.910 ± 0.332  ops/us
PostingIndexInputBenchmark.decodeVector                 10  thrpt   15  49.282 ± 0.400  ops/us

Summarising the perf results:

bpv	decode (Baseline → Contender, %Δ)	decodeAndPrefixSum (Baseline → Contender, %Δ)	decodeAndPrefixSumVector (Baseline → Contender, %Δ)	decodeVector (Baseline → Contender, %Δ)
2	56.139 → 56.685 (+1.0%)	31.135 → 32.170 (+3.3%)	33.515 → 35.892 (+7.1%)	85.568 → 85.634 (+0.08%)
3	47.809 → 50.723 (+6.1%)	27.499 → 30.107 (+9.5%)	33.492 → 34.986 (+4.4%)	74.447 → 74.220 (−0.3%)
4	56.922 → 53.631 (−5.8%)	25.458 → 25.955 (+2.0%)	33.055 → 33.011 (−0.1%)	87.720 → 91.992 (+4.9%)
5	44.117 → 45.242 (+2.6%)	23.068 → 22.113 (−4.1%)	31.688 → 32.070 (+1.2%)	62.400 → 62.661 (+0.4%)
6	41.455 → 47.290 (+14.1%)	22.404 → 21.684 (−3.2%)	28.542 → 30.406 (+6.5%)	67.609 → 69.889 (+3.4%)
7	39.891 → 46.388 (+16.3%)	20.858 → 22.410 (+7.4%)	29.425 → 29.563 (+0.5%)	62.465 → 61.817 (−1.0%)
8	85.428 → 85.133 (−0.3%)	24.736 → 29.921 (+21.0%)	33.136 → 33.238 (+0.3%)	85.587 → 85.631 (+0.05%)
9	29.763 → 32.661 (+9.7%)	19.946 → 21.974 (+10.2%)	22.099 → 22.095 (−0.02%)	33.910 → 34.308 (+1.2%)
10	30.598 → 37.306 (+21.9%)	20.841 → 22.544 (+8.2%)	25.956 → 26.039 (+0.3%)	49.282 → 49.525 (+0.5%)

TLDR;

Key takeaway:

Scalar path optimizations → measurable speedups
Vectorized path --> no change, because memory access pattern hasn’t changed
Regression in decodeVector & decodeAndPrefixSumVector is likely due to lane-wise memory loads vs contiguous vector loads

github-actions bot added the module:core/other label Aug 22, 2025

github-actions bot added this to the 11.0.0 milestone Aug 22, 2025

dweiss approved these changes Aug 22, 2025

View reviewed changes

rmuir approved these changes Aug 22, 2025

View reviewed changes

github-actions bot modified the milestones: 11.0.0, 10.3.0 Aug 22, 2025

PostingsDecodingUtil: interchange loops to enable better memory acces…

90835c5

…s and SIMD vectorisation

RamakrishnaChilaka force-pushed the postings_decoding_util_improvements branch from fe12ff6 to 90835c5 Compare August 26, 2025 12:12

uschindler approved these changes Aug 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation #15110

PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation #15110

Uh oh!

RamakrishnaChilaka commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

dweiss left a comment

Uh oh!

RamakrishnaChilaka commented Aug 22, 2025

Uh oh!

jpountz commented Aug 22, 2025

Uh oh!

RamakrishnaChilaka commented Aug 22, 2025

Uh oh!

RamakrishnaChilaka commented Aug 22, 2025 •

edited

Loading

Uh oh!

dweiss commented Aug 22, 2025

Uh oh!

RamakrishnaChilaka commented Aug 22, 2025

Uh oh!

RamakrishnaChilaka commented Aug 25, 2025 •

edited

Loading

Uh oh!

jpountz commented Aug 25, 2025

Uh oh!

dweiss commented Aug 26, 2025

Uh oh!

RamakrishnaChilaka commented Aug 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation #15110

Are you sure you want to change the base?

PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation #15110

Uh oh!

Conversation

RamakrishnaChilaka commented Aug 22, 2025

Description

Performance comparsion on intel i5-13600k

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

dweiss left a comment

Choose a reason for hiding this comment

Uh oh!

RamakrishnaChilaka commented Aug 22, 2025

Uh oh!

jpountz commented Aug 22, 2025

Uh oh!

RamakrishnaChilaka commented Aug 22, 2025

Uh oh!

RamakrishnaChilaka commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dweiss commented Aug 22, 2025

Uh oh!

RamakrishnaChilaka commented Aug 22, 2025

Uh oh!

RamakrishnaChilaka commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Aug 25, 2025

Uh oh!

dweiss commented Aug 26, 2025

Uh oh!

RamakrishnaChilaka commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TLDR;

Uh oh!

Uh oh!

RamakrishnaChilaka commented Aug 22, 2025 •

edited

Loading

RamakrishnaChilaka commented Aug 25, 2025 •

edited

Loading

RamakrishnaChilaka commented Aug 26, 2025 •

edited

Loading