Skip to content

Conversation

CuberMessenger
Copy link

Summary

Fix #309.

The FAISS score (distance) computation returns the square of L2 distance instead of L2 distance.

Also, when cosine similarity is used. FAISS should index with inner product because it equals to cosine similarity as long as the vectors are normalized to unit length.

Description

Please check out the langchain issue to get an overview.

Here's a simple example to show to difference

embedding_model = get_google_embedding_model()

sents = [
    "Sidewinder has used JavaScript to drop and execute malware loaders.",
]

db_euc = FAISS.from_texts(
    sents,
    embedding_model,
    ids=range(len(sents)),
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
    normalize_L2=True,
)

query = "execute malware loaders"

embedding_doc = np.array(db_euc._embed_query(db_euc.get_by_ids([0])[0].page_content))
embedding_euc = np.array(db_euc._embed_query(query))

# Euclidean distance and similarity by hand
euc_distence = np.sqrt(((embedding_doc - embedding_euc) ** 2).sum())
euc_similarity = 1 - euc_distence / (2**0.5)

print(f"[By hand] Euclidean distance: {euc_distence}")
print(f"[By hand] Euclidean similarity: {euc_similarity}")
"""
[By hand] Euclidean distance: 0.6381204577868655
[By hand] Euclidean similarity: 0.5487806970850433
"""

# Score by FAISS
score = db_euc.similarity_search_with_score(query, k=1)[0][1]

print(f"[FAISS] Score: {score}")
"""
[FAISS] Score: 0.40719807147979736

The score is actually the square of the euclidean distance.
(0.6381204577868655) ** 2) = 0.4071977186461187
"""

# Relevance score by FAISS
relevance_score = db_euc.similarity_search_with_relevance_scores(query, k=1)[0][1]
print(f"[FAISS] Relevance score: {relevance_score}")
"""
[FAISS] Relevance score: 0.7120674848556519

This relevance score is actually
0.7120674848556519 = 1 - (0.40719807147979736 / (2**0.5))
"""


### Monkey patch the fix ###
...

def similarity_search_with_score_by_vector(...) -> List[Tuple[Document, float]]:
    ...
    scores, indices = self.index.search(vector, k if filter is None else fetch_k)
    scores = np.sqrt(scores) ################# ADDED #########################
    ...
    return docs[:k]


FAISS.similarity_search_with_score_by_vector = similarity_search_with_score_by_vector

### Monkey patch the fix ###

# Test the fixed score by FAISS
score = db_euc.similarity_search_with_score(query, k=1)[0][1]

print(f"[Fixed FAISS] Score: {score}")
"""
[FAISS] Score: 0.638120710849762
This matches the by-hand calculation of the euclidean distance.
"""

# Test the fixed relevance score by FAISS
score = db_euc.similarity_search_with_relevance_scores(query, k=1)[0][1]
print(f"[Fixed FAISS] Relevance score: {score}")
"""
[FAISS] Relevance score: 0.5487805008888245
This matches the by-hand calculation of the euclidean similarity.
"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

drawbacks of formula of calculation euclidean similarity
2 participants