Skip to content

Conversation

RamakrishnaChilaka
Copy link
Contributor

PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation.

Description

Rewrite splitInts() so the innermost loop walks the array contiguously instead of strided manner. Giving chance to HotSpot C2 to vectorise the shift-and-mask operations and also improves cache locality.

No functional change - the temporary buffer b is only accessed inside
this method and its layout order is irrelevant to callers.

Performance comparsion on intel i5-13600k

Full JMH results are in the benchmark repo.

Speed ups:

Benchmark                                   (count)  (dec)   Mode  Cnt      Score     Error   Units
BitUnpackerBenchmark.baselineScalar              32      1  thrpt   25   1667.831 _   5.508  ops/ms
BitUnpackerBenchmark.baselineScalar              32      2  thrpt   25   3007.441 _  54.965  ops/ms
BitUnpackerBenchmark.baselineScalar              32      4  thrpt   25   5837.677 _  50.676  ops/ms
BitUnpackerBenchmark.baselineScalar              32      8  thrpt   25  10565.109 _ 133.238  ops/ms
BitUnpackerBenchmark.baselineScalar            1024      1  thrpt   25     13.448 _   0.036  ops/ms
BitUnpackerBenchmark.baselineScalar            1024      2  thrpt   25     25.123 _   0.024  ops/ms
BitUnpackerBenchmark.baselineScalar            1024      4  thrpt   25    210.144 _   2.174  ops/ms
BitUnpackerBenchmark.baselineScalar            1024      8  thrpt   25    379.248 _   9.067  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath       32      1  thrpt   25   3265.182 _  10.690  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath       32      2  thrpt   25   6507.689 _  19.246  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath       32      4  thrpt   25  12516.050 _  72.828  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath       32      8  thrpt   25  25706.735 _ 105.216  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath     1024      1  thrpt   25    132.753 _   0.223  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath     1024      2  thrpt   25    293.396 _   0.527  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath     1024      4  thrpt   25    597.617 _  32.484  ops/ms
BitUnpackerBenchmark.optimizedWithFastPath     1024      8  thrpt   25   1202.047 _  38.059  ops/ms

Summarising:

count dec baseline optimized factor
32 1 1 668 3 265 1.96×
32 2 3 007 6 508 2.16×
32 4 5 838 12 516 2.14×
32 8 10 565 25 707 2.43×
1024 1 13.4 132.8 9.9×
1024 2 25.1 293.4 11.7×
1024 4 210 597.6 2.85×
1024 8 379 1 202 3.17×

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 11.0.0 milestone Aug 22, 2025
Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

@RamakrishnaChilaka
Copy link
Contributor Author

@dweiss, Kindly, can you please merge as well, Thank you!

@jpountz
Copy link
Contributor

jpountz commented Aug 22, 2025

Should we do the same in MemorySegmentPostingDecodingUtil?

@RamakrishnaChilaka
Copy link
Contributor Author

Thanks @jpountz, for taking a look at this PR.

Yes, MemorySegmentPostingDecodingUtil has the same strided-write pattern. I’ll open a follow-up PR that flips the loops so the innermost store becomes the contiguous b[j * count + i], matching the optimisation.

@RamakrishnaChilaka
Copy link
Contributor Author

RamakrishnaChilaka commented Aug 22, 2025

@jpountz / @rmuir / @dweiss

Update: instead of a separate follow-up PR, I’ve added the same loop-reordering for MemorySegmentPostingDecodingUtil as the last commit in this PR.
Please review when convenient; I can still extract it into its own PR if you prefer.

@dweiss
Copy link
Contributor

dweiss commented Aug 22, 2025

If this is to be backported then the changes entry should go into 10x.

@github-actions github-actions bot modified the milestones: 11.0.0, 10.3.0 Aug 22, 2025
@RamakrishnaChilaka
Copy link
Contributor Author

Got it @dweiss. moved the change entry from 11.0 to 10.3 section.

@RamakrishnaChilaka
Copy link
Contributor Author

RamakrishnaChilaka commented Aug 25, 2025

.

@jpountz
Copy link
Contributor

jpountz commented Aug 25, 2025

I ran PostingIndexInputBenchmark with the vector module enabled to check performance, but it seems to report a slowdown.

main:

Benchmark                                      (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                  7  thrpt   15  37.939 ± 0.647  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum      7  thrpt   15  18.602 ± 0.413  ops/us

this PR:

Benchmark                                      (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                  7  thrpt   15  22.105 ± 1.001  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum      7  thrpt   15  12.791 ± 0.280  ops/us

I only ran with bpv=7 which is a "hard" number of bits per value, didn't check other numbers as I didn't have much time. I do see a speedup without the vector module enabled though:

main:
(results are very noisy on main and without the vector module enabled, not sure why)

Benchmark                                      (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                  7  thrpt   15  15.439 ± 8.188  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum      7  thrpt   15  11.709 ± 0.897  ops/us

this PR:

Benchmark                                      (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                  7  thrpt   15  30.618 ± 0.519  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum      7  thrpt   15  13.216 ± 0.232  ops/us

@dweiss
Copy link
Contributor

dweiss commented Aug 26, 2025

Yeah... I'm no longer so convinced we should accept micro-benchmarking results without looking at overall performance. When you run the code in a different context it seems to compile differently. Weird.

@RamakrishnaChilaka RamakrishnaChilaka force-pushed the postings_decoding_util_improvements branch from fe12ff6 to 90835c5 Compare August 26, 2025 12:12
@RamakrishnaChilaka
Copy link
Contributor Author

RamakrishnaChilaka commented Aug 26, 2025

In this PR, we targeted optimising the scalar path by rearranging loops so that memory accesses are sequential rather than strided, which gave a measurable performance improvement (see comment, the JMH only tests for scalar path).

I applied the same optimization to the vectorized path, but it didn’t yield the expected improvement — decodeVector performance actually regressed compared to baseline (as also verified by Adrian). I didn't run microbenchmarks for it prior to this.

I suspect this is because, in the mainline vectorized version, we are loading from the memory segment only for that lane. To achieve the same contiguous write pattern as the scalar version, we would need to load the entire vector from memory at once, and this may be causing the regression. I have reproduced this behavior consistently.

Please check decodeVector and decodeAndPrefixSumVector below.

contender -1

Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        7  thrpt   15  46.397 ± 1.220  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  22.414 ± 0.049  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  21.056 ± 0.179  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  37.523 ± 0.666  ops/us

contender-2
Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        7  thrpt   15  46.140 ± 0.370  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  22.415 ± 0.042  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  21.322 ± 0.048  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  37.132 ± 1.084  ops/us


baseline

Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        7  thrpt   15  39.971 ± 0.040  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  20.836 ± 0.039  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  24.911 ± 3.529  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  62.210 ± 1.306  ops/us

contender-3

Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        7  thrpt   15  46.313 ± 0.614  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  22.406 ± 0.037  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  17.476 ± 0.302  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  31.087 ± 1.201  ops/us

Summary of the perf changes:

Benchmark Baseline (ops/us) Contender-1 (ops/us, %Δ) Contender-2 (ops/us, %Δ) Contender-3 (ops/us, %Δ)
decode 39.971 46.397 (+16.1%) 46.140 (+15.4%) 46.313 (+15.9%)
decodeAndPrefixSum 20.836 22.414 (+7.6%) 22.415 (+7.6%) 22.406 (+7.5%)
decodeAndPrefixSumVector 24.911 ± 3.529 21.056 (−15.5%) 21.322 (−14.4%) 17.476 (−29.9%)
decodeVector 62.210 37.523 (−39.7%) 37.132 (−40.3%) 31.087 (−50.0%)

I also ran the benchmarks across all DPVs with this PR (only PostingDecodingUtil changes, MemorySegmentDecodingUtil reverted). As expected, the scalar version shows consistent improvements, while the vectorized version remains mostly unchanged (as expected, as we didn't change its flow).

contender
Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        2  thrpt   15  56.685 ± 0.596  ops/us
PostingIndexInputBenchmark.decode                        3  thrpt   15  50.723 ± 0.091  ops/us
PostingIndexInputBenchmark.decode                        4  thrpt   15  53.631 ± 2.356  ops/us
PostingIndexInputBenchmark.decode                        5  thrpt   15  45.242 ± 0.193  ops/us
PostingIndexInputBenchmark.decode                        6  thrpt   15  47.290 ± 0.061  ops/us
PostingIndexInputBenchmark.decode                        7  thrpt   15  46.388 ± 0.637  ops/us
PostingIndexInputBenchmark.decode                        8  thrpt   15  85.133 ± 0.560  ops/us
PostingIndexInputBenchmark.decode                        9  thrpt   15  32.661 ± 0.499  ops/us
PostingIndexInputBenchmark.decode                       10  thrpt   15  37.306 ± 0.350  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            2  thrpt   15  32.170 ± 0.085  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            3  thrpt   15  30.107 ± 0.039  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            4  thrpt   15  25.955 ± 0.058  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            5  thrpt   15  22.113 ± 0.014  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            6  thrpt   15  21.684 ± 0.006  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  22.410 ± 0.074  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            8  thrpt   15  29.921 ± 0.047  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            9  thrpt   15  21.974 ± 0.034  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum           10  thrpt   15  22.544 ± 0.063  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      2  thrpt   15  35.892 ± 0.182  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      3  thrpt   15  34.986 ± 0.097  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      4  thrpt   15  33.011 ± 0.312  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      5  thrpt   15  32.070 ± 0.044  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      6  thrpt   15  30.406 ± 0.118  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  29.563 ± 0.322  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      8  thrpt   15  33.238 ± 0.159  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      9  thrpt   15  22.095 ± 0.058  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector     10  thrpt   15  26.039 ± 0.085  ops/us
PostingIndexInputBenchmark.decodeVector                  2  thrpt   15  85.634 ± 5.048  ops/us
PostingIndexInputBenchmark.decodeVector                  3  thrpt   15  74.220 ± 2.586  ops/us
PostingIndexInputBenchmark.decodeVector                  4  thrpt   15  91.992 ± 0.821  ops/us
PostingIndexInputBenchmark.decodeVector                  5  thrpt   15  62.661 ± 0.321  ops/us
PostingIndexInputBenchmark.decodeVector                  6  thrpt   15  69.889 ± 1.028  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  61.817 ± 0.806  ops/us
PostingIndexInputBenchmark.decodeVector                  8  thrpt   15  85.631 ± 0.214  ops/us
PostingIndexInputBenchmark.decodeVector                  9  thrpt   15  34.308 ± 0.251  ops/us
PostingIndexInputBenchmark.decodeVector                 10  thrpt   15  49.525 ± 0.180  ops/us

baseline
Benchmark                                            (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode                        2  thrpt   15  56.139 ± 0.686  ops/us
PostingIndexInputBenchmark.decode                        3  thrpt   15  47.809 ± 0.121  ops/us
PostingIndexInputBenchmark.decode                        4  thrpt   15  56.922 ± 0.150  ops/us
PostingIndexInputBenchmark.decode                        5  thrpt   15  44.117 ± 0.125  ops/us
PostingIndexInputBenchmark.decode                        6  thrpt   15  41.455 ± 0.139  ops/us
PostingIndexInputBenchmark.decode                        7  thrpt   15  39.891 ± 0.182  ops/us
PostingIndexInputBenchmark.decode                        8  thrpt   15  85.428 ± 0.268  ops/us
PostingIndexInputBenchmark.decode                        9  thrpt   15  29.763 ± 0.259  ops/us
PostingIndexInputBenchmark.decode                       10  thrpt   15  30.598 ± 0.064  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            2  thrpt   15  31.135 ± 0.313  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            3  thrpt   15  27.499 ± 0.056  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            4  thrpt   15  25.458 ± 0.090  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            5  thrpt   15  23.068 ± 0.054  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            6  thrpt   15  22.404 ± 0.023  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  20.858 ± 0.008  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            8  thrpt   15  24.736 ± 0.091  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum            9  thrpt   15  19.946 ± 0.073  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSum           10  thrpt   15  20.841 ± 0.011  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      2  thrpt   15  33.515 ± 3.208  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      3  thrpt   15  33.492 ± 2.343  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      4  thrpt   15  33.055 ± 0.338  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      5  thrpt   15  31.688 ± 0.247  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      6  thrpt   15  28.542 ± 3.090  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  29.425 ± 0.039  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      8  thrpt   15  33.136 ± 0.037  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector      9  thrpt   15  22.099 ± 0.026  ops/us
PostingIndexInputBenchmark.decodeAndPrefixSumVector     10  thrpt   15  25.956 ± 0.064  ops/us
PostingIndexInputBenchmark.decodeVector                  2  thrpt   15  85.568 ± 4.920  ops/us
PostingIndexInputBenchmark.decodeVector                  3  thrpt   15  74.447 ± 2.404  ops/us
PostingIndexInputBenchmark.decodeVector                  4  thrpt   15  87.720 ± 5.573  ops/us
PostingIndexInputBenchmark.decodeVector                  5  thrpt   15  62.400 ± 0.683  ops/us
PostingIndexInputBenchmark.decodeVector                  6  thrpt   15  67.609 ± 0.525  ops/us
PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  62.465 ± 0.409  ops/us
PostingIndexInputBenchmark.decodeVector                  8  thrpt   15  85.587 ± 0.141  ops/us
PostingIndexInputBenchmark.decodeVector                  9  thrpt   15  33.910 ± 0.332  ops/us
PostingIndexInputBenchmark.decodeVector                 10  thrpt   15  49.282 ± 0.400  ops/us

Summarising the perf results:

bpv decode (Baseline → Contender, %Δ) decodeAndPrefixSum (Baseline → Contender, %Δ) decodeAndPrefixSumVector (Baseline → Contender, %Δ) decodeVector (Baseline → Contender, %Δ)
2 56.139 → 56.685 (+1.0%) 31.135 → 32.170 (+3.3%) 33.515 → 35.892 (+7.1%) 85.568 → 85.634 (+0.08%)
3 47.809 → 50.723 (+6.1%) 27.499 → 30.107 (+9.5%) 33.492 → 34.986 (+4.4%) 74.447 → 74.220 (−0.3%)
4 56.922 → 53.631 (−5.8%) 25.458 → 25.955 (+2.0%) 33.055 → 33.011 (−0.1%) 87.720 → 91.992 (+4.9%)
5 44.117 → 45.242 (+2.6%) 23.068 → 22.113 (−4.1%) 31.688 → 32.070 (+1.2%) 62.400 → 62.661 (+0.4%)
6 41.455 → 47.290 (+14.1%) 22.404 → 21.684 (−3.2%) 28.542 → 30.406 (+6.5%) 67.609 → 69.889 (+3.4%)
7 39.891 → 46.388 (+16.3%) 20.858 → 22.410 (+7.4%) 29.425 → 29.563 (+0.5%) 62.465 → 61.817 (−1.0%)
8 85.428 → 85.133 (−0.3%) 24.736 → 29.921 (+21.0%) 33.136 → 33.238 (+0.3%) 85.587 → 85.631 (+0.05%)
9 29.763 → 32.661 (+9.7%) 19.946 → 21.974 (+10.2%) 22.099 → 22.095 (−0.02%) 33.910 → 34.308 (+1.2%)
10 30.598 → 37.306 (+21.9%) 20.841 → 22.544 (+8.2%) 25.956 → 26.039 (+0.3%) 49.282 → 49.525 (+0.5%)

TLDR;

Key takeaway:

  • Scalar path optimizations → measurable speedups
  • Vectorized path --> no change, because memory access pattern hasn’t changed
  • Regression in decodeVector & decodeAndPrefixSumVector is likely due to lane-wise memory loads vs contiguous vector loads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants