Bulk dot product with prefetching #1

mccullocht · 2025-07-09T20:56:55Z

Score in groups of up to size 8, prefetching some stride ahead of the load pointer in each vector.

This is about 35% faster than the java implementation and about 30% faster than the earlier native implementation.

Without the prefetching code this is about 25% faster than the java implementation and 20% faster than native.

mccullocht added 3 commits July 9, 2025 13:11

score 4 at a time, ~25% better

98c7e8f

slightly more judicious prefetching, plus early prefetching

89afabb

score in chunks of sizes 2, 4, or 8. overall about 1/3 faster

915ba1f

Provide feedback