-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Make BitSet no longer implement Bits. #14996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Lucene has call sites of `Bits#get` that are more polymorphic than they should because they sometimes receive a `FixedBitSet` as `Bits`, and sometimes a `FixedBits` (the return type of `FixedBitSet#asReadOnlyBits`). This makes some call sites bimorphic when they could be monomorphic, of polymorphic when they could be bimorphic (e.g. if `SparseFixedBitSet` also get used). Given how polymorphism affects performance, I would like to fix this performance trap. One idea would be to make `BitSet` no longer implement `Bits`, ie. you could no longer pass a `FixedBitSet` when `Bits` are expected, you would need to pass the result of `FixedBitSet#asReadOnlyBits`. This is what this PR does. I had to move/copy some APIs around between `Bits`, `BitSet` and `FixedBitSet` to make it work, but it's not as invasive as I initially worried. The API change is a 11.0 change, but we could backport the bits that reduce polymorphism to 10.x, ie. making sure to pass the result of `FixedBitSet#asReadOnlyBits` rather than the `FixedBitSet` directly when `Bits` are expected.
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
@msokolov @benwtrent You may want to take a look since vector search is probably the most performance-critical code that is touched by this change due to
With this PR, dense filters and unfiltered searches both use a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I left a few small comments/questions
private final int numDocs; | ||
private final CacheHelper readerCacheHelper; | ||
|
||
private SoftDeletesFilterLeafReader(LeafReader reader, FixedBitSet bits, int numDocs) { | ||
private SoftDeletesFilterLeafReader(LeafReader reader, Bits bits, int numDocs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why we are changing these signatures - it seems the call sites still always use FixedBitSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Bits
are indeed always created bia FixedBitSet#asReadOnlyBits
but I liked doing it on the call site better than in this constructor. Since it's a private constructor, it doesn't break users.
} | ||
if (length < bitSet.length() | ||
&& bitSet.nextSetBit(Math.max(0, length)) != DocIdSetIterator.NO_MORE_DOCS) { | ||
throw new IllegalArgumentException("Some bits are set beyond the end of live docs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment refers to "live docs" but the method is generic and could apply to other kinds of bit set applications - maybe change to "beyond the end of the bit set"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'll fix.
approximateSearch( | ||
ctx, acceptDocs.asReadOnlyBits(), cost + 1, timeLimitingKnnCollectorManager); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within the HNSW format, we do this:
if (acceptDocs instanceof BitSet bitSet) {
// Use approximate cardinality as this is good enough, but ensure we don't exceed the graph
// size as that is illogical
filteredDocCount = Math.min(bitSet.approximateCardinality(), graph.size());
if (unfilteredVisit >= filteredDocCount) {
doHnsw = false;
}
}
We determine the filter size via casting the Bits
to BitSet
. So, I am not sure we can make this change without progressing: #15011
Or bits
should give cardinality information...
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Lucene has call sites of
Bits#get
that are more polymorphic than they should because they sometimes receive aFixedBitSet
asBits
, and sometimes aFixedBits
(the return type ofFixedBitSet#asReadOnlyBits
). This makes some call sites bimorphic when they could be monomorphic, of polymorphic when they could be bimorphic (e.g. ifSparseFixedBitSet
also gets used).Given how polymorphism affects performance, I would like to fix this performance trap. One idea would be to make
BitSet
no longer implementBits
, ie. you could no longer pass aFixedBitSet
whenBits
are expected, you would need to pass the result ofFixedBitSet#asReadOnlyBits
. This is what this PR does.I had to move/copy some APIs around between
Bits
,BitSet
andFixedBitSet
to make it work, but it's not as invasive as I initially worried.The API change is a 11.0 change, but we could backport the bits that reduce polymorphism to 10.x, ie. making sure to pass the result of
FixedBitSet#asReadOnlyBits
rather than theFixedBitSet
directly whenBits
are expected.