Leverage optimized native float32 vector scorers. #130541

ChrisHegarty · 2025-07-03T10:54:39Z

Leverage optimized native float32 vector scorers.

The changes in this PR give approximately 2x performance increase for float32 vector operations across Linux/ Mac AArch64 and Linux x64 (both AVX2 and AVX 512).

The vector scorers leverage the native vector operations added by #130635.

The tests verify that the native scorers return similar values to that of the lucene scorers.

TODO: feature flag the new format so that we only create indices with that format when enabled.

ChrisHegarty · 2025-07-03T17:13:59Z

The micro benchmarks all show approx 2x performance improvement in scorer operations, all platforms. For example:

Apple Mac M2, AArch64
Low-level benchmark results. Compare dotProductLuceneWithCopy to dotProductNativeWithNativeSeg, bigger is better.

Benchmark                                                (size)  Mode  Cnt    Score    Error  Units
JDKVectorFloat32Benchmark.dotProductLucene                 1024  avgt   15   60.448 ±  4.160  ns/op
JDKVectorFloat32Benchmark.dotProductLuceneWithCopy         1024  avgt   15  115.741 ± 11.562  ns/op
JDKVectorFloat32Benchmark.dotProductNativeWithHeapSeg      1024  avgt   15   60.691 ±  4.329  ns/op
JDKVectorFloat32Benchmark.dotProductNativeWithNativeSeg    1024  avgt   15   59.111 ±  0.751  ns/op

Scorer benchmark. Compare dotProductLuceneQuery to dotProductNativeQuery, bigger is better.

Benchmark                                     (dims)   Mode  Cnt  Score   Error   Units
Float32ScorerBenchmark.dotProductLucene         1024  thrpt    5  3.522 ± 0.025  ops/us
Float32ScorerBenchmark.dotProductLuceneQuery    1024  thrpt    5  3.969 ± 0.110  ops/us
Float32ScorerBenchmark.dotProductNative         1024  thrpt    5  7.772 ± 0.060  ops/us
Float32ScorerBenchmark.dotProductNativeQuery    1024  thrpt    5  8.260 ± 0.123  ops/us
Float32ScorerBenchmark.dotProductScalar         1024  thrpt    5  0.602 ± 0.003  ops/us

ldematte

I've concentrated mainly on the native (C) part and its interaction (NativeAccess/VectorLibrary), and that looks good to me.
I have looked at the benchmarks and tests and they look sensible, but I think it's better to have another pair of eyes on the other parts (VectorScorer/lucene interaction)

libs/simdvec/native/src/vec/c/amd64/vec.c

ldematte · 2025-07-04T08:04:35Z

Benchmarks looks good, and they are a testament of the goodness of Panama and its usage in Lucene! Hopefully OpenJDK will get rid of the bug "soon" :)

…ions (#130635) This commit adds low-level optimized Neon, AVX2, and AVX 512 float32 vector operations; cosine, dot product, and square distance. The changes in this PR give approximately 2x performance increase for float32 vector operations across Linux/ Mac AArch64 and Linux x64 (both AVX2 and AVX 512). The performance increase comes mostly from being able to score the vectors off-heap (rather than copying on-heap before scoring). The low-level native scorer implementations show only approx ~3-5% improvement over the existing Panama Vector implementation. However, the native scorers allow to score off-heap. The use of Panama Vector with MemorySegments runs into a performance bug in Hotspot, where the bound is not optimally hoisted out of the hot loop (has been reported and acknowledged by OpenJDK) . This vector ops will be used by higher-level vector scorers in #130541

elasticsearchmachine · 2025-07-07T20:50:03Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

ChrisHegarty · 2025-07-07T20:51:18Z

server/src/main/java/org/elasticsearch/index/codec/vectors/es819/ES819HnswVectorsFormat.java

+
+// Minimal copy of Lucene99HnswVectorsFormat in order to provide an optimized scorer,
+// which returns identical scores to that of the default flat vector scorer.
+public class ES819HnswVectorsFormat extends KnnVectorsFormat {


I'm considering whether or not it is worth back porting this to 8.19.x.

IMO, this is way too complex to actually backport to the 8.19 branch.

However, "backporting" the change to allow the scorers to be used on indices created in the 8.19 series (which is the majority of indices still), is a very nice thing :)

Yeah. Oddly, because we already have a ES813FlatVectorFormat, it means that we can just hook the native scorer implementation in and quite a bunch of existing flat indices will get it. For HNSW, we just use the Lucene99 format so we cannot as easily do that.

benwtrent · 2025-09-02T20:40:56Z

@ChrisHegarty @ldematte where are we on this PR? Do we want to do a chunk at a time and maybe by-pass the new format just now?

FYI, I am seeing other places where we should do something like this and we can hook in off-heap scoring like this for rescoring vectors as well. #131048

ldematte · 2025-09-03T08:37:41Z

My 2c: I would do one step at a time, and merge this PR as-is (if it is correct -- e.g. if CI passes). The part that I'm familiar with (the NativeAccess stuff) lgtm.

ChrisHegarty · 2025-09-03T11:44:58Z

I deliberately paused this PR, in order to switch focus onto the bulk scoring API in Lucene. I think that we should reflow this API and the native scorers on top of the upcoming Lucene bulk scoring API. That way we can use prefetch instructions on the next vector as we're scoring the current vector - or just score say 4 vectors per loop itr. It will also reduce the cost of Java to native transitions - which are small, but a little more costly on x64 since we use a shared segment for the index.

benwtrent · 2025-09-03T15:05:34Z

I deliberately paused this PR, in order to switch focus onto the bulk scoring API in Lucene. I think that we should reflow this API and the native scorers on top of the upcoming Lucene bulk scoring API.

I agree, but you mean that no native code should be done with single function calls?

I would expect this PR and an additional one would be needed. Where the follow up would be turning on bulk-scoring (e.g. 4 vectors at a time, or something).

Add optimized native Neon, AVX2, and AVX 512 float32 vector operations.

4682213

ChrisHegarty added :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Jul 3, 2025

Merge branch 'main' into native_vector_ops

cdfe46f

elasticsearchmachine added the v9.2.0 label Jul 3, 2025

elasticsearchmachine and others added 3 commits July 3, 2025 11:03

[CI] Auto commit changes from spotless

876da01

use built artifact

df27125

Merge branch 'main' into native_vector_ops

1dee562

ChrisHegarty added test-windows Trigger CI checks on Windows test-arm Pull Requests that should be tested against arm agents labels Jul 3, 2025

test itr

d936652

ldematte self-requested a review July 3, 2025 15:08

Merge branch 'main' into native_vector_ops

fabd06f

ChrisHegarty added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Jul 4, 2025

ldematte reviewed Jul 4, 2025

View reviewed changes

libs/simdvec/native/src/vec/c/amd64/vec.c Outdated Show resolved Hide resolved

ChrisHegarty and others added 2 commits July 4, 2025 10:04

horizontal_sum_avx2 -> hsum_f32_8

e1a4d77

Merge branch 'main' into native_vector_ops

4003de4

ChrisHegarty mentioned this pull request Jul 4, 2025

Add low-level optimized Neon, AVX2, and AVX 512 float32 vector operations #130635

Merged

Merge branch 'main' into native_vector_ops

f41e232

ChrisHegarty changed the title ~~Add optimized Neon, AVX2, and AVX 512 float32 vector operations.~~ Add optimized Neon, AVX2, and AVX 512 float32 vector scorers. Jul 7, 2025

ChrisHegarty changed the title ~~Add optimized Neon, AVX2, and AVX 512 float32 vector scorers.~~ Leverage optimized native float32 vector scorers. Jul 7, 2025

ChrisHegarty marked this pull request as ready for review July 7, 2025 20:49

ChrisHegarty commented Jul 7, 2025

View reviewed changes

ChrisHegarty added 2 commits July 8, 2025 14:34

rework bench support for heap segs

da8ac6e

Merge branch 'main' into native_vector_ops

5a3ef18

benwtrent mentioned this pull request Jul 10, 2025

Do off-heap scoring for all vector rescoring #131048

Open

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Leverage optimized native float32 vector scorers. #130541

Leverage optimized native float32 vector scorers. #130541

Uh oh!

ChrisHegarty commented Jul 3, 2025 •

edited

Loading

Uh oh!

ChrisHegarty commented Jul 3, 2025

Uh oh!

ldematte left a comment

Uh oh!

Uh oh!

ldematte commented Jul 4, 2025

Uh oh!

elasticsearchmachine commented Jul 7, 2025

Uh oh!

ChrisHegarty Jul 7, 2025

Uh oh!

benwtrent Jul 9, 2025

Uh oh!

ChrisHegarty Jul 9, 2025

Uh oh!

benwtrent commented Sep 2, 2025

Uh oh!

ldematte commented Sep 3, 2025

Uh oh!

ChrisHegarty commented Sep 3, 2025

Uh oh!

benwtrent commented Sep 3, 2025

Uh oh!

Uh oh!

Leverage optimized native float32 vector scorers. #130541

Are you sure you want to change the base?

Leverage optimized native float32 vector scorers. #130541

Uh oh!

Conversation

ChrisHegarty commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisHegarty commented Jul 3, 2025

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ldematte commented Jul 4, 2025

Uh oh!

elasticsearchmachine commented Jul 7, 2025

Uh oh!

ChrisHegarty Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

benwtrent Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Sep 2, 2025

Uh oh!

ldematte commented Sep 3, 2025

Uh oh!

ChrisHegarty commented Sep 3, 2025

Uh oh!

benwtrent commented Sep 3, 2025

Uh oh!

Uh oh!

ChrisHegarty commented Jul 3, 2025 •

edited

Loading