-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Brings back Scorer#applyAsRequiredClause
#14968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I suspect that the fact that you first compute the intersection and then compute scores also helps as it's more cache-friendly (less data to have in the various caches at once) and potentially enables vectorizing the computation of scores. |
I'd suggest to focus this first PR on the |
Yeah, I think it's a good idea, I did some experiment with some detail of current version of code these days.
If I still use the
Not sure what causes the differences : ( |
Also, I'm wondering whether we can change the protocal of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran it on my machine and got a tiny speedup, possibly because vectorization of score computations gives more gains on your machine?
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrMany 23.60 (1.8%) 23.45 (3.0%) -0.7% ( -5% - 4%) 0.408
CountTerm 9274.34 (1.7%) 9215.96 (2.9%) -0.6% ( -5% - 4%) 0.401
TermMonthSort 3352.95 (2.1%) 3333.71 (1.6%) -0.6% ( -4% - 3%) 0.328
CountOrMany 28.93 (1.7%) 28.82 (3.5%) -0.4% ( -5% - 4%) 0.658
FilteredOrHighMed 153.09 (1.2%) 152.62 (1.1%) -0.3% ( -2% - 1%) 0.390
CountOrHighHigh 339.09 (2.5%) 338.47 (4.2%) -0.2% ( -6% - 6%) 0.867
FilteredOr3Terms 166.53 (1.2%) 166.40 (1.2%) -0.1% ( -2% - 2%) 0.832
FilteredOrHighHigh 67.14 (2.2%) 67.12 (1.4%) -0.0% ( -3% - 3%) 0.946
CountPhrase 4.23 (2.2%) 4.22 (2.2%) -0.0% ( -4% - 4%) 0.975
TermDTSort 387.92 (2.7%) 388.12 (2.6%) 0.1% ( -5% - 5%) 0.951
FilteredOr2Terms2StopWords 146.92 (1.1%) 146.99 (1.1%) 0.1% ( -2% - 2%) 0.880
FilteredPhrase 32.18 (1.5%) 32.20 (1.5%) 0.1% ( -2% - 3%) 0.897
FilteredOrMany 16.55 (1.0%) 16.57 (2.1%) 0.1% ( -2% - 3%) 0.826
CountFilteredOrHighMed 148.61 (0.7%) 148.81 (0.8%) 0.1% ( -1% - 1%) 0.593
CountFilteredPhrase 25.45 (1.7%) 25.49 (1.5%) 0.2% ( -2% - 3%) 0.756
FilteredOrStopWords 45.64 (2.3%) 45.73 (1.4%) 0.2% ( -3% - 4%) 0.749
CountFilteredOrMany 27.07 (1.6%) 27.12 (1.2%) 0.2% ( -2% - 2%) 0.644
CountFilteredOrHighHigh 136.63 (0.9%) 136.91 (1.1%) 0.2% ( -1% - 2%) 0.508
FilteredIntNRQ 294.74 (0.7%) 295.84 (0.8%) 0.4% ( -1% - 1%) 0.130
AndHighOrMedMed 51.29 (1.7%) 51.49 (1.2%) 0.4% ( -2% - 3%) 0.405
CombinedTerm 39.36 (1.2%) 39.52 (0.9%) 0.4% ( -1% - 2%) 0.255
CountOrHighMed 357.07 (2.2%) 358.88 (2.9%) 0.5% ( -4% - 5%) 0.535
Or2Terms2StopWords 205.70 (1.3%) 206.84 (1.5%) 0.6% ( -2% - 3%) 0.211
TermTitleSort 85.42 (3.8%) 85.89 (3.9%) 0.6% ( -6% - 8%) 0.651
FilteredPrefix3 149.80 (2.0%) 150.67 (2.4%) 0.6% ( -3% - 5%) 0.406
CountAndHighHigh 356.56 (2.2%) 358.88 (2.2%) 0.7% ( -3% - 5%) 0.347
CombinedAndHighHigh 23.45 (1.0%) 23.61 (1.0%) 0.7% ( -1% - 2%) 0.027
CountAndHighMed 306.71 (1.7%) 308.91 (2.0%) 0.7% ( -2% - 4%) 0.219
FilteredTerm 161.63 (2.5%) 162.80 (1.8%) 0.7% ( -3% - 5%) 0.295
And2Terms2StopWords 204.26 (1.6%) 205.81 (1.7%) 0.8% ( -2% - 4%) 0.147
CombinedAndHighMed 89.38 (1.0%) 90.09 (0.8%) 0.8% ( -1% - 2%) 0.006
OrHighMed 257.60 (2.2%) 259.70 (1.8%) 0.8% ( -3% - 4%) 0.203
OrHighHigh 78.29 (2.6%) 78.96 (2.5%) 0.9% ( -4% - 6%) 0.292
AndHighMed 202.14 (1.8%) 204.00 (1.9%) 0.9% ( -2% - 4%) 0.116
TermDayOfYearSort 285.19 (1.2%) 287.96 (1.4%) 1.0% ( -1% - 3%) 0.020
CombinedOrHighMed 87.78 (2.8%) 88.76 (0.9%) 1.1% ( -2% - 4%) 0.089
Or3Terms 231.64 (1.4%) 234.56 (1.9%) 1.3% ( -2% - 4%) 0.018
And3Terms 240.09 (1.5%) 243.34 (1.9%) 1.4% ( -2% - 4%) 0.013
AndMedOrHighHigh 88.00 (2.1%) 89.20 (2.3%) 1.4% ( -2% - 5%) 0.050
FilteredAnd2Terms2StopWords 215.32 (1.0%) 218.60 (1.2%) 1.5% ( 0% - 3%) 0.000
FilteredAndStopWords 65.47 (1.6%) 66.51 (1.4%) 1.6% ( -1% - 4%) 0.001
FilteredAndHighMed 156.30 (1.2%) 158.91 (1.2%) 1.7% ( 0% - 4%) 0.000
CombinedOrHighHigh 23.10 (3.7%) 23.49 (1.2%) 1.7% ( -3% - 6%) 0.052
OrStopWords 49.00 (2.4%) 49.83 (2.2%) 1.7% ( -2% - 6%) 0.019
FilteredAnd3Terms 189.73 (0.8%) 193.06 (1.1%) 1.8% ( 0% - 3%) 0.000
AndHighHigh 69.14 (2.4%) 70.36 (2.3%) 1.8% ( -2% - 6%) 0.017
FilteredAndHighHigh 79.13 (1.7%) 80.57 (1.3%) 1.8% ( -1% - 4%) 0.000
Term 655.28 (6.0%) 668.10 (4.7%) 2.0% ( -8% - 13%) 0.249
AndStopWords 47.11 (2.1%) 48.58 (2.4%) 3.1% ( -1% - 7%) 0.000
OrHighRare 292.97 (10.8%) 303.18 (6.0%) 3.5% ( -12% - 22%) 0.206
* this {@link Scorer} to the scores. | ||
*/ | ||
public void applyAsRequiredClause(DocAndScoreAccBuffer buffer) throws IOException { | ||
DocIdSetIterator iterator = iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should throw an exception if this scorer exposes a twoPhaseIterator()
(and add javadocs about it), since running the conjunction this way would be less efficient than doing it in a "doc first" fashion.
This doesn't sound clean to me. We should be careful with adding new APIs, but I'd rather have two simple APIs than a complex one. |
Did you ran agaist the latest version or the older version which contains a new api on PostingEnum ? I'm little bit curious about the performance of previous version (if it is not the one you ran on)
Make sense to me too, I'm also heisitate about that, thanks for your reply! |
I ran against the latest version. I'll try to run the older version soon. |
Here's a run against the previous version:
|
Thank you for running the benchmark !! |
FWIW I suspect that the speedup to tasks such as |
If I understand correctly, removing |
Good idea! |
I ran the benchmark under identical setup on previous version (which add a new api on
It produced similar speedup, but some queries like Maybe the speedup do not worth the complicity of new added apis ? I'm little bit confused |
I'm confused too, but the current version of the change is quite simple and produces a speedup with a low p-value, so I'm keen on getting it in. |
Good point! |
Another idea: in order to not add new APIs, an alternative would be to implement specialized bulk scorers for the case when all scorers are term scorers, on the same field (a common case, and arguably the case we're most interested in optimizing) and work directly on |
Because nightly benchmarks only test a small set of scenarios, the JVM may end up over-optimizing query evaluation. For instance, it only runs with BM25Similarity, sorting tasks only run against a TermQuery, filtered vector search only exercises the approximate path, not the exact path, etc. This tries to make the benchmark more realistic by running some cheap queries before running bencharks, whose goal is to pollute call sites so that they are not all magically monomorphic. This will translate in a drop in performance for some tasks, but hopefully we can recover some of it in the future. Related PR: - apache/lucene#14968 where we suspected the speedup to be due to specialization making a call site monomorphic in nightly benchmarks that would not be monomorphic in the real world, - apache/lucene#15039 where we are trying to improve behavior with several different similarity impls but the benchmarks only show a small improvement since they always run with BM25Similarity.
I'm waiting for #15039 to merge, and looking forward to dig a little bit more about this
Since #15004 is merged, I ran the benchmark with result below:
I'm planning to do another round of benchmark after mikemccand/luceneutil#436 is merged, maybe the speedup is not real ? |
Because nightly benchmarks only test a small set of scenarios, the JVM may end up over-optimizing query evaluation. For instance, it only runs with BM25Similarity, sorting tasks only run against a TermQuery, filtered vector search only exercises the approximate path, not the exact path, etc. This tries to make the benchmark more realistic by running some cheap queries before running bencharks, whose goal is to pollute call sites so that they are not all magically monomorphic. This will translate in a drop in performance for some tasks, but hopefully we can recover some of it in the future. Related PR: - apache/lucene#14968 where we suspected the speedup to be due to specialization making a call site monomorphic in nightly benchmarks that would not be monomorphic in the real world, - apache/lucene#15039 where we are trying to improve behavior with several different similarity impls but the benchmarks only show a small improvement since they always run with BM25Similarity.
I'm curious if this change helps more with type pollution, especially if we start using |
I'd run another benchmark tonight |
Here is the benchmark result:
|
It's a bit disappointing, but I haven't seen a big speedup with the introduction of |
If I understand correctly, the |
Possibly, I'm not sure about what the API should look like. I started playing with replacing |
Description
Inspired by #14690, this PR essentially tries to bring back the
Scorer#applyAsRequiredClause
interface, but different from #14690 , I'm wondering whether we can pass theDocAndScoreAccBuffer
all the way down to the posting, so maybe we can benefit from reducing theadvance
function calls when the buffer have a dense doc id set, eg, utilize the SIMID again. So I added a new interface onPostingsEnum#nextRequiredFreqBuffer
(not stable yet), currently I only implement the default implemetation, still trying to speedup the process underBlockPostingEnum
.This is still under development, I know we should be cautious about adding new public interface (especially two at once!), but I want to share current progress, below are the luceneutil benchmark result on
wikimediumall
withsearchConcurrency=0, taskCountPerCat=5, taskRepeatCount=50
, here is the result after 20 iterations (against the latest code):I think it's promissing to look into this approach more. If I understand correctly , this speedup should only come from the reduces virtual function calls ?