Skip to content

Conversation

ChrisHegarty
Copy link
Contributor

This commit adds low-level optimized Neon, AVX2, and AVX 512 float32 vector operations; cosine, dot product, and square distance.

The changes in this PR give approximately 2x performance increase for float32 vector operations across Linux/ Mac AArch64 and Linux x64 (both AVX2 and AVX 512).

The performance increase comes mostly from being able to score the vectors off-heap (rather than copying on-heap before scoring). The low-level native scorer implementations show only approx ~3-5% improvement over the existing Panama Vector implementation. However, the native scorers allow to score off-heap. The use of Panama Vector with MemorySegments runs into a performance bug in Hotspot, where the bound is not optimally hoisted out of the hot loop (has been reported and acknowledged by OpenJDK) .

This vector ops will be used by higher-level vector scorers in #130541

@ChrisHegarty ChrisHegarty requested a review from ldematte July 4, 2025 13:38
@ChrisHegarty ChrisHegarty requested a review from a team as a code owner July 4, 2025 13:38
@ChrisHegarty ChrisHegarty added :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0 labels Jul 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@ChrisHegarty
Copy link
Contributor Author

The micro benchmarks all show approx 2x performance improvement in scorer operations, all platforms. For example:

Apple Mac M2, AArch64
Low-level benchmark results. Compare dotProductLuceneWithCopy to dotProductNativeWithNativeSeg, bigger is better.

Benchmark                                                (size)  Mode  Cnt    Score    Error  Units
JDKVectorFloat32Benchmark.dotProductLucene                 1024  avgt   15   60.448 ±  4.160  ns/op
JDKVectorFloat32Benchmark.dotProductLuceneWithCopy         1024  avgt   15  115.741 ± 11.562  ns/op
JDKVectorFloat32Benchmark.dotProductNativeWithHeapSeg      1024  avgt   15   60.691 ±  4.329  ns/op
JDKVectorFloat32Benchmark.dotProductNativeWithNativeSeg    1024  avgt   15   59.111 ±  0.751  ns/op

Scorer benchmark. Compare dotProductLuceneQuery to dotProductNativeQuery, bigger is better.

Benchmark                                     (dims)   Mode  Cnt  Score   Error   Units
Float32ScorerBenchmark.dotProductLucene         1024  thrpt    5  3.522 ± 0.025  ops/us
Float32ScorerBenchmark.dotProductLuceneQuery    1024  thrpt    5  3.969 ± 0.110  ops/us
Float32ScorerBenchmark.dotProductNative         1024  thrpt    5  7.772 ± 0.060  ops/us
Float32ScorerBenchmark.dotProductNativeQuery    1024  thrpt    5  8.260 ± 0.123  ops/us
Float32ScorerBenchmark.dotProductScalar         1024  thrpt    5  0.602 ± 0.003  ops/us

@ChrisHegarty ChrisHegarty added test-windows Trigger CI checks on Windows test-arm Pull Requests that should be tested against arm agents >enhancement labels Jul 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @ChrisHegarty, I've created a changelog YAML for you.

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for breaking this out in a PR!

@ChrisHegarty ChrisHegarty merged commit b486d90 into elastic:main Jul 4, 2025
38 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch test-arm Pull Requests that should be tested against arm agents test-windows Trigger CI checks on Windows v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants