Add low-level optimized Neon, AVX2, and AVX 512 float32 vector operations #130635

ChrisHegarty · 2025-07-04T13:38:51Z

This commit adds low-level optimized Neon, AVX2, and AVX 512 float32 vector operations; cosine, dot product, and square distance.

The changes in this PR give approximately 2x performance increase for float32 vector operations across Linux/ Mac AArch64 and Linux x64 (both AVX2 and AVX 512).

The performance increase comes mostly from being able to score the vectors off-heap (rather than copying on-heap before scoring). The low-level native scorer implementations show only approx ~3-5% improvement over the existing Panama Vector implementation. However, the native scorers allow to score off-heap. The use of Panama Vector with MemorySegments runs into a performance bug in Hotspot, where the bound is not optimally hoisted out of the hot loop (has been reported and acknowledged by OpenJDK) .

This vector ops will be used by higher-level vector scorers in #130541

…ions.

elasticsearchmachine · 2025-07-04T13:39:15Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

ChrisHegarty · 2025-07-04T13:39:42Z

The micro benchmarks all show approx 2x performance improvement in scorer operations, all platforms. For example:

Apple Mac M2, AArch64
Low-level benchmark results. Compare dotProductLuceneWithCopy to dotProductNativeWithNativeSeg, bigger is better.

Benchmark                                                (size)  Mode  Cnt    Score    Error  Units
JDKVectorFloat32Benchmark.dotProductLucene                 1024  avgt   15   60.448 ±  4.160  ns/op
JDKVectorFloat32Benchmark.dotProductLuceneWithCopy         1024  avgt   15  115.741 ± 11.562  ns/op
JDKVectorFloat32Benchmark.dotProductNativeWithHeapSeg      1024  avgt   15   60.691 ±  4.329  ns/op
JDKVectorFloat32Benchmark.dotProductNativeWithNativeSeg    1024  avgt   15   59.111 ±  0.751  ns/op

Scorer benchmark. Compare dotProductLuceneQuery to dotProductNativeQuery, bigger is better.

Benchmark                                     (dims)   Mode  Cnt  Score   Error   Units
Float32ScorerBenchmark.dotProductLucene         1024  thrpt    5  3.522 ± 0.025  ops/us
Float32ScorerBenchmark.dotProductLuceneQuery    1024  thrpt    5  3.969 ± 0.110  ops/us
Float32ScorerBenchmark.dotProductNative         1024  thrpt    5  7.772 ± 0.060  ops/us
Float32ScorerBenchmark.dotProductNativeQuery    1024  thrpt    5  8.260 ± 0.123  ops/us
Float32ScorerBenchmark.dotProductScalar         1024  thrpt    5  0.602 ± 0.003  ops/us

elasticsearchmachine · 2025-07-04T14:03:25Z

Hi @ChrisHegarty, I've created a changelog YAML for you.

ldematte

LGTM, thanks for breaking this out in a PR!

Add low-level optimized Neon, AVX2, and AVX 512 float32 vector operat…

62a8782

…ions.

ChrisHegarty requested a review from ldematte July 4, 2025 13:38

ChrisHegarty requested a review from a team as a code owner July 4, 2025 13:38

ChrisHegarty added :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0 labels Jul 4, 2025

ChrisHegarty added test-windows Trigger CI checks on Windows test-arm Pull Requests that should be tested against arm agents >enhancement labels Jul 4, 2025

ChrisHegarty added 2 commits July 4, 2025 15:02

Merge branch 'main' into native_float32

d8254f1

Update docs/changelog/130635.yaml

4ed416d

ldematte reviewed Jul 4, 2025

View reviewed changes

ldematte approved these changes Jul 4, 2025

View reviewed changes

ChrisHegarty merged commit b486d90 into elastic:main Jul 4, 2025
38 of 44 checks passed

ChrisHegarty mentioned this pull request Jul 7, 2025

Leverage optimized native float32 vector scorers. #130541

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add low-level optimized Neon, AVX2, and AVX 512 float32 vector operations #130635

Add low-level optimized Neon, AVX2, and AVX 512 float32 vector operations #130635

Uh oh!

ChrisHegarty commented Jul 4, 2025

Uh oh!

elasticsearchmachine commented Jul 4, 2025

Uh oh!

ChrisHegarty commented Jul 4, 2025

Uh oh!

elasticsearchmachine commented Jul 4, 2025

Uh oh!

ldematte left a comment

Uh oh!

Uh oh!

Uh oh!

Add low-level optimized Neon, AVX2, and AVX 512 float32 vector operations #130635

Add low-level optimized Neon, AVX2, and AVX 512 float32 vector operations #130635

Uh oh!

Conversation

ChrisHegarty commented Jul 4, 2025

Uh oh!

elasticsearchmachine commented Jul 4, 2025

Uh oh!

ChrisHegarty commented Jul 4, 2025

Uh oh!

elasticsearchmachine commented Jul 4, 2025

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!