Skip to content

Conversation

vga91
Copy link
Collaborator

@vga91 vga91 commented Aug 19, 2025

Fixes #4447

  • cherry-pick in dev
  • waiting for next versions, to check if APIs will change

[WIP]

Note:
To use vector at the moment we have to test it with an enterprise and this config:

internal.cypher.enable_vector_type=true

The genai.vector.cosine function doesn't exist,
i think is intended the genai.vector.encode together with the vector.similarity.cosine .

I don't think there is a pro to create a new apoc procedure, since using this cypher query:

MATCH (node:Similar)
WITH vector.similarity.cosine(node.embedding, $queryVector) AS score
WHERE score >= $threshold
RETURN node, score
ORDER BY score DESC
LIMIT $topK

is faster then the:

CALL custom.search.batchedSimilarity(nodes, 'vect', VECTOR([1, 2, 3], 3, INTEGER32), 5, 0.8) YIELD node, score

While using the encode we have to execute e.g:

MATCH (node:Similar)
WITH genai.vector.encode('propertyToEncode', 'OpenAI', { token: "<tokenKey>" }) AS propertyVector
WITH vector.similarity.cosine(propertyVector, $queryVector) AS score
WHERE score >= $threshold
RETURN node, score
ORDER BY score DESC
LIMIT $topK

Tried to replicate the implementation of the one present in vector.similarity.cosine from here but it quite hard to understand it and maybe is part of a non-public source code.

At this time the Java Vector API / SIMD is not feasible, or better is not useful since it produce a bottleneck, since it leverage arrays like float[], double[] ... (like documented here and here).
For now we can access only a specific index and retrieve float/double of the vector (see here, we can't access to the entire coordinates)

But in any case maybe the pure Cypher method it would still be better since it leverage the Java Vector API as well (in fact if we execute the vector.similarity.cosine function the following message is printed):

WARNING: Java vector incubator module is not readable. For optimal vector performance, pass '--add-modules jdk.incubator.vector' to enable Vector API.

With the vector.similarity.cosine function using vector data type the multiple conversion before similarity doesn't seem to be present anymore as it compare directly the vector data types, of converting e.g. List coming from the genai.vector.encode results


CYPHER 25 return valueType(VECTOR([1,2,3], 3, INTEGER8)) 
// ---> returns VECTOR<INTEGER8 NOT NULL>(3) NOT NULL
with genai.vector.encode('titleAndPlot', 'OpenAI', { token: "<tokenKey>" }) AS propertyVector
return valueType(propertyVector)
// ---> returns LIST<FLOAT NOT NULL> NOT NULL

Times:

procedure = 567
pure cypher = 283

@vga91 vga91 marked this pull request as draft August 19, 2025 09:41
@vga91 vga91 changed the title [WIP] Issue 4447 [WIP] Issue 4447: batched cosine / euclidean procedure for more efficient computation of vector similarities Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant