Skip to content

Conversation

jpountz
Copy link
Collaborator

@jpountz jpountz commented Aug 12, 2025

Because nightly benchmarks only test a small set of scenarios, the JVM may end up over-optimizing query evaluation. For instance, it only runs with BM25Similarity, sorting tasks only run against a TermQuery, filtered vector search only exercises the approximate path, not the exact path, etc.

This tries to make the benchmark more realistic by running some cheap queries before running bencharks, whose goal is to pollute call sites so that they are not all magically monomorphic.

This will translate in a drop in performance for some tasks, but hopefully we can recover some of it in the future.

Related PR:

Because nightly benchmarks only test a small set of scenarios, the JVM may end
up over-optimizing query evaluation. For instance, it only runs with
BM25Similarity, sorting tasks only run against a TermQuery, filtered vector
search only exercises the approximate path, not the exact path, etc.

This tries to make the benchmark more realistic by running some cheap queries
before running bencharks, whose goal is to pollute call sites so that they are
not all magically monomorphic.

This will translate in a drop in performance for some tasks, but hopefully we
can recover some of it in the future.

Related PR:
 - apache/lucene#14968 where we suspected the speedup
   to be due to specialization making a call site monomorphic in nightly
   benchmarks that would not be monomorphic in the real world,
 - apache/lucene#15039 where we are trying to improve
   behavior with several different similarity impls but the benchmarks only
   show a small improvement since they always run with BM25Similarity.
@jpountz
Copy link
Collaborator Author

jpountz commented Aug 12, 2025

Here's the result of a run where pollution is disabled on the baseline and enabled on the competitor:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      OrHighHigh       79.20      (2.6%)       65.72      (3.8%)  -17.0% ( -22% -  -10%) 0.000
                     OrStopWords       49.64      (2.7%)       42.41      (3.6%)  -14.6% ( -20% -   -8%) 0.000
                     AndHighHigh       70.67      (2.4%)       60.57      (5.6%)  -14.3% ( -21% -   -6%) 0.000
                       OrHighMed      260.29      (2.4%)      223.68      (3.0%)  -14.1% ( -19% -   -8%) 0.000
                      AndHighMed      206.71      (2.2%)      179.94      (5.3%)  -12.9% ( -19% -   -5%) 0.000
                        Or3Terms      236.19      (2.3%)      209.14      (2.6%)  -11.5% ( -15% -   -6%) 0.000
                    AndStopWords       48.47      (2.4%)       42.98      (4.7%)  -11.3% ( -18% -   -4%) 0.000
                            Term      661.48      (4.9%)      587.73      (5.3%)  -11.2% ( -20% -   -1%) 0.000
                       And3Terms      247.78      (1.9%)      224.68      (3.4%)   -9.3% ( -14% -   -4%) 0.000
              Or2Terms2StopWords      207.03      (1.7%)      187.84      (2.1%)   -9.3% ( -12% -   -5%) 0.000
              FilteredAndHighMed      159.86      (1.3%)      146.89      (2.1%)   -8.1% ( -11% -   -4%) 0.000
                AndMedOrHighHigh       88.35      (2.5%)       81.29      (2.6%)   -8.0% ( -12% -   -2%) 0.000
             And2Terms2StopWords      208.43      (1.6%)      192.54      (2.8%)   -7.6% ( -11% -   -3%) 0.000
             FilteredAndHighHigh       81.64      (1.2%)       76.06      (1.9%)   -6.8% (  -9% -   -3%) 0.000
                      OrHighRare      299.01      (6.8%)      279.28      (7.2%)   -6.6% ( -19% -    7%) 0.045
                          OrMany       23.43      (3.1%)       21.93      (1.8%)   -6.4% ( -10% -   -1%) 0.000
            FilteredAndStopWords       67.63      (1.6%)       63.38      (1.7%)   -6.3% (  -9% -   -3%) 0.000
               CombinedOrHighMed       87.87      (1.8%)       82.37      (2.6%)   -6.3% ( -10% -   -1%) 0.000
                    CombinedTerm       39.26      (1.9%)       36.82      (2.9%)   -6.2% ( -10% -   -1%) 0.000
              CombinedOrHighHigh       23.23      (1.8%)       21.81      (2.9%)   -6.1% ( -10% -   -1%) 0.000
     FilteredAnd2Terms2StopWords      220.25      (1.3%)      208.60      (1.6%)   -5.3% (  -8% -   -2%) 0.000
               FilteredAnd3Terms      193.49      (1.5%)      183.53      (1.3%)   -5.1% (  -7% -   -2%) 0.000
                       CountTerm     9397.84      (4.0%)     9023.88      (3.2%)   -4.0% ( -10% -    3%) 0.019
                   TermTitleSort       86.62      (9.3%)       83.52      (3.8%)   -3.6% ( -15% -   10%) 0.286
                 AndHighOrMedMed       50.97      (1.3%)       49.34      (0.7%)   -3.2% (  -5% -   -1%) 0.000
                  FilteredOrMany       16.41      (2.0%)       15.94      (1.4%)   -2.9% (  -6% -    0%) 0.000
                 CountAndHighMed      308.78      (2.6%)      300.40      (1.8%)   -2.7% (  -6% -    1%) 0.011
              CombinedAndHighMed       89.12      (1.7%)       86.71      (0.8%)   -2.7% (  -5% -    0%) 0.000
              FilteredOrHighHigh       67.49      (1.7%)       65.72      (1.7%)   -2.6% (  -5% -    0%) 0.001
                  CountOrHighMed      360.58      (3.4%)      351.94      (1.6%)   -2.4% (  -7% -    2%) 0.055
             FilteredOrStopWords       45.82      (2.0%)       44.75      (1.8%)   -2.3% (  -6% -    1%) 0.011
      FilteredOr2Terms2StopWords      146.98      (0.9%)      143.83      (1.1%)   -2.1% (  -4% -    0%) 0.000
                FilteredOr3Terms      167.05      (1.3%)      163.59      (0.8%)   -2.1% (  -4% -    0%) 0.000
             CombinedAndHighHigh       23.33      (2.0%)       22.85      (0.9%)   -2.0% (  -4% -    0%) 0.005
               FilteredOrHighMed      153.17      (1.1%)      150.24      (0.9%)   -1.9% (  -3% -    0%) 0.000
                  FilteredPhrase       32.03      (2.0%)       31.45      (1.1%)   -1.8% (  -4% -    1%) 0.016
                 CountOrHighHigh      344.82      (2.1%)      339.65      (2.5%)   -1.5% (  -5% -    3%) 0.164
                 FilteredPrefix3      150.97      (1.0%)      148.70      (3.1%)   -1.5% (  -5% -    2%) 0.170
                     CountOrMany       29.38      (2.0%)       28.95      (1.8%)   -1.5% (  -5% -    2%) 0.097
                     CountPhrase        4.23      (1.8%)        4.17      (3.0%)   -1.4% (  -6% -    3%) 0.240
                    FilteredTerm      161.78      (2.2%)      159.63      (1.8%)   -1.3% (  -5% -    2%) 0.154
             CountFilteredOrMany       27.28      (1.9%)       27.03      (2.0%)   -0.9% (  -4% -    3%) 0.313
         CountFilteredOrHighHigh      137.45      (1.0%)      136.35      (1.2%)   -0.8% (  -2% -    1%) 0.124
          CountFilteredOrHighMed      149.19      (0.8%)      148.05      (1.1%)   -0.8% (  -2% -    1%) 0.091
                      TermDTSort      385.74      (4.7%)      383.98      (2.4%)   -0.5% (  -7% -    6%) 0.795
                CountAndHighHigh      359.36      (2.3%)      358.28      (2.3%)   -0.3% (  -4% -    4%) 0.781
             CountFilteredPhrase       25.14      (2.2%)       25.07      (2.6%)   -0.3% (  -4% -    4%) 0.798
                   TermMonthSort     3332.40      (2.5%)     3328.19      (2.3%)   -0.1% (  -4% -    4%) 0.910
                  FilteredIntNRQ      299.10      (1.3%)      299.17      (1.3%)    0.0% (  -2% -    2%) 0.966
               TermDayOfYearSort      279.58      (4.9%)      280.34      (1.2%)    0.3% (  -5% -    6%) 0.872

@jpountz
Copy link
Collaborator Author

jpountz commented Aug 12, 2025

@ChrisHegarty I think I remember seeing something like that in one of your recent PRs but I can't find it anymore?

@ChrisHegarty
Copy link
Contributor

@ChrisHegarty I think I remember seeing something like that in one of your recent PRs but I can't find it anymore?

Yeah, I had something similar in the benchmark update of this PR apache/lucene#15037. I still need to make it optional, so it can be enabled or not for comparison.

Generally, I do think that this is a good idea, as it will allow us to find such potential problems so that we can fix 'em and make performance more consistent.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Owner

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz -- this is a great idea to make benchy more real-world realistic.

@@ -0,0 +1,174 @@
package perf;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs ASL copyright header.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for noticing, I added one.

final TestContext testContext = TestContext.parse(args.getString("-context", ""));

if (pollute) {
TypePolluter.pollute();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious that the one-time pollution is enough! Hotspot doesn't noticed that things later got singular and then re-optimize?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I'm not intimate enough with Hotspot to give you an answer. I suspect that it technically could, but that it wouldn't help that much in real-world applications, so it doesn't bother. @ChrisHegarty may have more data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that what's in the PR is fine. It is possible that things change over time and that Hostpot could potentially optimise differently in the future when profiles change, but like Adrien, I'm less worried about this in real world scenarios.

@jpountz jpountz merged commit b2228fe into mikemccand:main Aug 13, 2025
1 check passed
@jpountz jpountz deleted the pollute branch August 13, 2025 14:50
jpountz added a commit to jpountz/lucene that referenced this pull request Aug 17, 2025
I ran experiments locally that suggest that some of the performance decrease
from type pollution (mikemccand/luceneutil#436)
can be attributed to calls to `SimScorer#score` no longer being inlinable since
they are polymorphic. This change helps `BM25Scorer` remain inlinable using
similar tricks that we are applying for `Bits#get` and
`ImpactsEnum#nextDoc`/`ImpactsEnum#advance`.

Hopefully changes such as apache#15039 will help improve performance with other
similarities as well in the future.
@jpountz
Copy link
Collaborator Author

jpountz commented Aug 23, 2025

I pushed an annotation for this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants