Skip to content

Conversation

Pulkitg64
Copy link
Contributor

@Pulkitg64 Pulkitg64 commented Jul 29, 2025

Description

This is a draft PR to optimize HNSW graph merging during singleton merges. When merging a single segment with deletions, the current implementation reconstructs the entire graph with only live nodes, which is a time-consuming process. This PR avoids full graph reconstruction by dropping deleted nodes and renumbering the remaining live nodes.

TODOs:

Add specific unit tests
Benchmarks (luceneutil)

@Pulkitg64 Pulkitg64 changed the title Avoid reconstructing HNSW graph during singleton merging Avoid reconstructing HNSW graph during singleton merges Jul 29, 2025
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@jpountz
Copy link
Contributor

jpountz commented Aug 1, 2025

I don't feel qualified to do the review, but I agree with the motivation. I wonder if this optimization could be applied when there are more than 1 segment to merge by first applying deletions on the bigger segment to merge and then adding vectors from other segments?

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a promising direction! I left a bunch of comments. My main one is about whether we should do this on-heap to make it more flexible (eg so we could use it when merging multiple graphs, too).

@@ -69,6 +70,8 @@ public final class Lucene99HnswVectorsWriter extends KnnVectorsWriter {

private static final long SHALLOW_RAM_BYTES_USED =
RamUsageEstimator.shallowSizeOfInstance(Lucene99HnswVectorsWriter.class);
static final int DELETE_THRESHOLD_PERCENT = 30;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if we have done any testing to motivate this choice? I guess as the number of gaps in the neighborhoods left behind by removing the deleted nodes in the graph increases we would expect to see a drop-off in recall, or maybe performance? but I don't have a good intuition about whether there is a knee in the curve, or how strong the effect is

* @throws IOException If an error occurs while writing to the vector index
*/
private HnswGraph deleteNodesWriteGraph(
Lucene99HnswVectorsReader.OffHeapHnswGraph graph,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change the signature to accept an HnswGraph?

// Count and collect valid nodes
int validNodeCount = 0;
for (int node : sortedNodes) {
if (docMap.get(node) != -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might be able to pass in the size of the new graph? At least in the main case of merging we should know (I think?)

}

// Special case for top level with no valid nodes
if (level == numLevels - 1 && validNodeCount == 0 && level > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if level ==0 and validNodeCount == 0 the new graph should be empty. I'm not sure how that case will get handled here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case though (the top level would be empty) -- isn't it also possible that a lower level is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if level ==0 and validNodeCount == 0 the new graph should be empty. I'm not sure how that case will get handled here?

This means 100% nodes are deleted, right? I think we will never reach this case as entry condition to this function is checking if deletes are less than 30%.

validNodeCount = 1; // We'll create one connection to lower level
}

validNodesPerLevel[level] = new int[validNodeCount];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could avoid the up-front counting, allocate a full-sized array and then use only the part of it that we fill up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Math.toIntExact(vectorIndex.getFilePointer() - offsetStart);
}

// Special case for empty top level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should special case the first empty level we findand make that the top level, unless it is the bottom level in which case the whole graph is empty


/** Writes neighbors with delta encoding to the vector index. */
private void writeNeighbors(
Lucene99HnswVectorsReader.OffHeapHnswGraph graph,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we delegate to an existing method (maybe with a refactor) to ensure we write in the same format? EG what if we switch to GroupVarInt encoding - we want to make sure this method tracks that change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@Override
public void mergeOneField(FieldInfo fieldInfo, MergeState mergeState) throws IOException {
CloseableRandomVectorScorerSupplier scorerSupplier =
flatVectorWriter.mergeOneFieldToIndex(fieldInfo, mergeState);
try {
long vectorIndexOffset = vectorIndex.getFilePointer();

if (mergeState.liveDocs.length == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you seen IncrementalHnswGraphMerge and MergingHnswGraphBuilder? They select the biggest graph with no deletions and merge the other segments' graphs into it. Could we expose a utility method here for rewriting a graph (in memory) to drop deletions, and then use it there?

Here we are somewhat mixing the on-disk graph format with the logic of dropping deleted nodes, which I think we could abstract out intoi the util.hnsw realm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw that class. I think this is a good idea. Will do it in next revision.

@Pulkitg64
Copy link
Contributor Author

I wonder if this optimization could be applied when there are more than 1 segment to merge by first applying deletions on the bigger segment to merge and then adding vectors from other segments?

@jpountz Yes good idea, let me try doing that only in this PR

this seems like a promising direction! I left a bunch of comments. My main one is about whether we should do this on-heap to make it more flexible (eg so we could use it when merging multiple graphs, too).

Thanks @msokolov . Yes I think that would be best way forward for this optimization. Working on it.


// Process nodes at this level
for (int node : sortedNodes) {
if (docMap.get(node) == -1) {
Copy link
Contributor Author

@Pulkitg64 Pulkitg64 Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. Graph does not store docIDs but instead they store ordinal. Whereas docMap maps oldDocIds to new DocIDs.
The correct implementation is to create a map which maps old ords to new ords.

Will fix this in next revision.

Copy link
Contributor

github-actions bot commented Aug 9, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Copy link
Contributor

github-actions bot commented Aug 9, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Copy link
Contributor

github-actions bot commented Aug 9, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@Pulkitg64
Copy link
Contributor Author

Pulkitg64 commented Aug 11, 2025

The failing test is running fine on my macOS desktop and I have not changed anything in the related classes. Even with the same failing seed I am unable to reproduce the issue. Not sure why this test in failing in check.

TestBPReorderingMergePolicy > testReorderOnAddIndexes FAILED
    java.lang.AssertionError: Called on the wrong instance
        at [email protected]/org.apache.lucene.tests.codecs.asserting.AssertingKnnVectorsFormat$AssertingKnnVectorsReader.getFloatVectorValues(AssertingKnnVectorsFormat.java:140)
        at [email protected]/org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.getFloatVectorValues(PerFieldKnnVectorsFormat.java:289)
        at [email protected]/org.apache.lucene.index.CodecReader.getFloatVectorValues(CodecReader.java:244)
        at [email protected]/org.apache.lucene.index.SlowCompositeCodecReaderWrapper$SlowCompositeKnnVectorsReaderWrapper.getFloatVectorValues(SlowCompositeCodecReaderWrapper.java:842)
        at [email protected]/org.apache.lucene.index.CodecReader.getFloatVectorValues(CodecReader.java:244)
        at org.apache.lucene.misc.index.BpVectorReorderer.computeDocMap(BpVectorReorderer.java:590)
        at org.apache.lucene.misc.index.BPReorderingMergePolicy$1.reorder(BPReorderingMergePolicy.java:138)
        at [email protected]/org.apache.lucene.index.IndexWriter.addIndexesReaderMerge(IndexWriter.java:3426)
        at [email protected]/org.apache.lucene.index.IndexWriter$AddIndexesMergeSource.merge(IndexWriter.java:3334)
        at [email protected]/org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:664)
        at [email protected]/org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:726)

@Pulkitg64 Pulkitg64 changed the title Avoid reconstructing HNSW graph during singleton merges Avoid reconstructing HNSW graphs during segment merging. Aug 11, 2025
@Pulkitg64
Copy link
Contributor Author

I created a pull request to my own repository, there tests are working fine: Pulkitg64#1

This issue looks to be transient and next commit should fix this.

@Pulkitg64
Copy link
Contributor Author

Pulkitg64 commented Aug 13, 2025

Adding some KnnPerfTestResults where I tried to simulate deletes while indexing docs. We are seeing consistent improvement in Indexing Time and Indexing Rate (except one weird case when we deleted 40% docs) without impacting recall.

Num Docs: 1MM
Max-Conn: 32
Beam-Width: 250
Quantize Bits: 32
Topk: 100

Experiment Baseline Candidate % Change
% Deletes Recall Indexing Time Indexing Rate Recall Indexing Time Indexing Rate Indexing Time Indexing Rate
25 0.952 692 1443 0.955 576 1734 -17% 20%
30 0.952 581 1719 0.958 517 1932 -11% 12%
40 0.951 560 1782 0.945 553 1805 -1% 1%
50 0.96 446 2241 0.953 421 2371 -6% 6%
60 0.974 234 4265 0.972 208 4804 -11% 13%

@msokolov
Copy link
Contributor

I am confused! This PR suddenly got so much simpler, which is great, but I feel like it dropped a few things that seemed important. EG we are no longer checking the largest graph to see if its delete % is below a threshold? Also I think we are now ignoring the various edge cases around upper-level graph layers possibly becoming empty?

@Pulkitg64
Copy link
Contributor Author

With MaxConn = 16, I am seeing much better results. But on a weird case with 25% delete I am seeing regression in indexing rate. Trying maxConn=8 in next benchmark run

Experiment Baseline Candidate % Change
% Deletes Recall Indexing Time (s) Indexing Rate (docs/s) Recall Indexing Time (s) Indexing Rate (docs/s) Indexing Time Indexing Rate
25 0.922 453 2205 0.914 484 2063 7% -6%
30 0.918 470 2125 0.94 279 3581 -41% 69%
40 0.903 494 2022 0.942 258 3867 -48% 91%
50 0.915 421 2372 0.946 223 4466 -47% 88%
60 0.934 301 3303 0.947 214 4658 -29% 41%

@benwtrent
Copy link
Member

@Pulkitg64 what exactly are you benchmarking? It seems like the latest version of this PR does nothing to actually correct the graph nodes?

We should handle:

  • If layers get completely removed (do we promote new nodes?)
  • Removing deleted nodes and reconnecting the neighbors to their nearest non-deleted
  • Completely throwing away the graph if deletion percentage is above a certain threshold (the first commit of hits PR had that at 30%, I think it can maybe be as high as 50%).

@benwtrent
Copy link
Member

Ah, maybe I don't fully grok the current impl. It seems like its doing the "largest graph" thing, but now its more clever and doing the initialized graph thing and that is where the deletes are being removed?

@Pulkitg64
Copy link
Contributor Author

I am confused! This PR suddenly got so much simpler, which is great,

Yeah, the initGraph implementation in InitializedHnswGraphBuilder.java simplifies lot of things for us as it is already providing support of creating OnHeapHnswGraph by passing OffHeapGraph from the older segment.

we are no longer checking the largest graph to see if its delete % is below a threshold?

Yes, in the first revision I added the arbitrary percentage without doing any testing. But this time, I wanted to see the impact of mergePolicy that kicks in when delete % is higher than certain threshold. I thought we may not need to add explicit check of checking delete % of largest graph because merge policy will automatically take care of this.

Also I think we are now ignoring the various edge cases around upper-level graph layers possibly becoming empty?

initGraph implementation takes care of it. In the implementation we start with top level and if there is no live node in that level the new entry node is never set and when we will iterate to next level with some live nodes there we will set the new entry node. Hence in this way we remove the risk of empty upper layer in the graph. But on the other there is still risk of completely deleting middle layer which we need to take care I believe.

@Pulkitg64
Copy link
Contributor Author

Ah, maybe I don't fully grok the current impl. It seems like its doing the "largest graph" thing, but now its more clever and doing the initialized graph thing and that is where the deletes are being removed?

That's right @benwtrent, we are skipping deleted nodes from the largest graph in the initGraph implementation.

@benwtrent
Copy link
Member

@Pulkitg64 pretty damn clever ;). I gotta think through this. Intuitively, it SHOULD work, even for singleton merges

@msokolov
Copy link
Contributor

It's fascinating that we actually see recall improving in many cases! Intuitively, I think when we merge more segments in we have an opportunity to patch up the holes left by the deleted docs, and maybe we somehow end up doing that in an even better way the second time around?

I do wonder what recall will look like for graphs with high deletion rates that are singleton-merged only? I wonder if we could test that with luceneutil by creating a single-segment index (with force-merge), deleting 50% of the docs, and then force-merging again?

@Pulkitg64
Copy link
Contributor Author

Based on @msokolov suggestion, I ran the benchmarks by simulating singleton merging. For this I indexed 1M docs and then force merge the segments then delete documents and then again force merge the segment.

I am seeing consistent improvement (about 50x speedup) in force merge time after deletes but also degradation in recall numbers (about 10%). It's probably because of disconnectedness issue (Let me try to find connectedness number of these graphs as well.)

Experiment Baseline Candidate Change
Delete Pct Recall Force Merge Time (s) Recall Force Merge Time Recall Force Merge Time
50% delete 0.892 417.52 0.763 8.43 -14% 50x
40% delete 0.887 505.74 0.799 9.91 -10% 50x
30% delete 0.88 585 0.822 10.98 -7% 53x
20% delete 0.878 677 0.802 12.4 -9% 54x
10% delete 0.874 772.42 0.856 13.5 -2% 59x

@benwtrent
Copy link
Member

It's probably because of disconnectedness issue (Let me try to find connectedness number of these graphs as well.)

I would think so. My gut is that we don't actually go through and "fixup" anything when there is just one graph. We just pick the biggest one, and since there are no more vectors to add, we just drop connections on the ground.

I would expect us to have to iterate through the graph and for every vector that is significantly disconnected, attempt to reconnect it with NNDescent starting at is original place in the graph (initializing with neighbor's neighbors if all its connections were removed).

@Pulkitg64
Copy link
Contributor Author

Pulkitg64 commented Sep 2, 2025

Thanks @benwtrent for the suggestion. For now, I am thinking that we can keep threshold of 10% deletes i.e. we will consider only those segments for merging without building graph from scratch for which delete % is less than or equal to 10%.

I can create a separate issue/PR for fixing the graph (reconnecting nodes) and try to increase delete threshold from 10%. Please let me know your thoughts

I re-ran the benchmark again with varying delete % till 15% and results are similar only.

Experiment Baseline Candidate Change
Delete Pct Recall Force Merge Time (s) Recall Force Merge Time Recall Force Merge Time
0% delete 0.872 0 0.873 0
2% delete 0.871 831 0.866 13 -1% 64x
5% delete 0.873 810 0.863 13 -1% 62x
8% delete 0.874 783 0.861 13 -1% 60x
10% delete 0.874 773 0.857 13 -2% 60x
15% delete 0.876 730 0.848 12 -3% 60x

Also ran with different max-conn by keeping the delete % threshold as 10%:

Experiment Baseline Candidate Change
Max Con Delete Pct Recall Force Merge Time (s) Recall Force Merge Time Recall Force Merge Time
32 10% delete 0.874 773 0.857 13 -2% 60x
16 10% delete 0.811 550 0.793 12 -2% 45x
8 10% delete 0.696 360 0.675 12 -3% 30x

Raising a new revision with the threshold limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants