fix(vector-core): harden cosine similarity with input validation, zero-magnitude handling, and edge case tests#881
Open
bhaktofmahakal wants to merge 4 commits intoHelixDB:mainfrom
Conversation
…, fix typos, add edge case tests Co-authored-by: bhaktofmahakal <113044681+bhaktofmahakal@users.noreply.github.com>
…es, add near-zero test Co-authored-by: bhaktofmahakal <113044681+bhaktofmahakal@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the vector-core cosine similarity and HNSW entry points to avoid undefined/invalid similarity results (e.g., empty or zero-magnitude vectors) and adds regression tests for key edge cases.
Changes:
- Add input validation and NaN/Infinity guards to
cosine_similarity, and remove a staleprintln!. - Reject empty vector data at HNSW
insert()andsearch()entry points; fix a key-length boundary check and typo in error messages. - Add/adjust tests to cover cosine-similarity edge cases and updated upsert behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
helix-db/src/helix_engine/vector_core/vector_distance.rs |
Adds validation/guards to cosine similarity and removes debug output (but AVX2 path needs alignment). |
helix-db/src/helix_engine/vector_core/vector_core.rs |
Rejects empty vectors in insert/search, fixes key length check, typo fix, and makes filter handling more idiomatic. |
helix-db/src/helix_engine/tests/vector_tests.rs |
Adds cosine-similarity edge case tests and updates max-distance test to avoid zero vectors. |
helix-db/src/helix_engine/tests/traversal_tests/upsert_tests.rs |
Updates upsert test to expect an error for empty vector data. |
Comments suppressed due to low confidence (1)
helix-db/src/helix_engine/vector_core/vector_distance.rs:42
- In the AVX2 fast-path,
cosine_similarityreturnscosine_similarity_avx2(from, to)butcosine_similarity_avx2currently returnsf64, notResult<f64, VectorError>. This will fail to compile whenevertarget_feature = "avx2"is enabled, and it also bypasses the new zero/near-zero magnitude and NaN/Infinity validation added in the scalar path. Update the AVX2 implementation (and the call site) to returnResult<f64, VectorError>and apply the same validation semantics as the scalar implementation so behavior is consistent across builds.
#[cfg(target_feature = "avx2")]
{
return cosine_similarity_avx2(from, to);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8 tasks
…alidation with scalar path Address review feedback from Copilot and Greptile reviewers on PR HelixDB#881: - Change cosine_similarity_avx2 return type from f64 to Result<f64, VectorError> - Add zero/near-zero magnitude check (f64::EPSILON threshold) in AVX2 path - Add NaN/Infinity guard in AVX2 path - Ensures consistent behavior across scalar and SIMD code paths Co-authored-by: bhaktofmahakal <113044681+bhaktofmahakal@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens the vector core module with input validation, proper error handling for edge cases, and bug fixes that prevent silent data corruption and misleading results in cosine similarity calculations.
Problem
The vector core had several correctness issues:
-1.0fromcosine_similarity()— this is mathematically undefined and misleading (treated as "most dissimilar" instead of signaling an error)println!in production path —println!("mis-match in vector dimensions!")was left incosine_similarity()is_none() || unwrap()anti-pattern inselect_neighbors— logically correct but fragile and non-idiomaticget_all_vectors— checkedprefix + 16bytes but key format requiresprefix + 16 + 8"emtpy search result"in two error messagesChanges
vector_distance.rs— Cosine similarity hardeningErr(InvalidVectorData)(was: proceed to divide-by-zero)Err(InvalidVectorData)usingf64::EPSILONthreshold (was:Ok(-1.0))println!vector_core.rs— Structural improvementsinsert()andsearch()entry points"emtpy"→"empty"in 2 error messagesfilter.is_none() || filter.unwrap()...with idiomaticfilter.as_ref().map_or(true, ...)select_neighborsget_all_vectorskey length check:prefix + 16→prefix + 16 + 8Tests — 12 new edge case tests
test_hvector_distance_maxto use opposite vectors (was using zero-magnitude)test_upsert_vto expect error on empty vector dataTest Results
All 1228 tests pass (0 failures), verified with:
cargo test --lib -p helix-db -- --skip concurrency_tests --skip hnsw_tests --skip worker_pool_tests
Summary of What Changed (4 files)
vector_distance.rsErrinstead of-1.0; removedprintln!; added NaN guardvector_core.rsinsert()&search(); fixed typo "emtpy"→"empty"; fixed key length check; idiomatic filter patternvector_tests.rsupsert_tests.rsGreptile Summary
This PR hardens the vector core module by adding input validation and proper error handling to
cosine_similarityand the HNSW insert/search entry points, replacing a silent incorrect return value (Ok(-1.0)for zero-magnitude vectors) with explicit errors, and adding 12 new edge-case tests.Key changes:
vector_distance.rs: Adds empty-vector, zero-magnitude (< f64::EPSILON), and NaN/Inf guards to the scalar cosine similarity path; removes the staleprintln!vector_core.rs: Empty data validation added ininsert()andsearch(); fixes boundary check inget_all_vectors(+16→+16+8); fixes two "emtpy" typos; replacesis_none() || unwrap()anti-pattern with idiomaticmap_orvector_tests.rs: 12 new tests for edge cases;test_hvector_distance_maxupdated to use valid opposite vectorsupsert_tests.rs: Test updated to assert that empty vector data now returns an errorIssue found:
vector_distance.rsare placed after the#[cfg(target_feature = "avx2")]early return, meaning they are completely bypassed on AVX2 targets. Additionally,cosine_similarity_avx2still has a return type off64rather thanResult<f64, VectorError>, which would produce a compile error on any AVX2-capable build. The empty-vector and dimension-mismatch guards are correctly placed before the AVX2 branch, but the magnitude and NaN protections need to be replicated inside the AVX2 path (or the function signature updated) to make the fix complete.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A["cosine_similarity(from, to)"] --> B{Either input empty?} B -- Yes --> C["Err(InvalidVectorData)"] B -- No --> D{Lengths equal?} D -- No --> E["Err(InvalidVectorLength)"] D -- Yes --> F{AVX2 feature enabled?} F -- Yes --> G["cosine_similarity_avx2(from, to)\nreturns f64 ⚠️\nNo magnitude/NaN guard"] F -- No --> H["Scalar loop:\ncompute dot_product,\nmagnitude_a, magnitude_b"] H --> I{"magnitude_a < ε\nor magnitude_b < ε?"} I -- Yes --> J["Err(InvalidVectorData)"] I -- No --> K["similarity = dot / (√mag_a × √mag_b)"] K --> L{NaN or Infinite?} L -- Yes --> M["Err(InvalidVectorData)"] L -- No --> N["Ok(similarity)"] style G fill:#ffcccc,stroke:#cc0000 style C fill:#ffe0e0 style E fill:#ffe0e0 style J fill:#ffe0e0 style M fill:#ffe0e0 style N fill:#ccffcc,stroke:#00aa00Comments Outside Diff (1)
helix-db/src/helix_engine/vector_core/vector_distance.rs, line 39-42 (link)New guards bypass AVX2 path
The zero-magnitude check (line 83) and NaN/Inf guard (line 89) added by this PR are placed after the
#[cfg(target_feature = "avx2")]early return on line 42. When compiled with AVX2 support enabled,cosine_similarity_avx2is called instead, which has no magnitude check or NaN guard — a zero-magnitude vector on an AVX2 target would still produce aNaNorinfresult rather thanErr(InvalidVectorData).Additionally,
cosine_similarity_avx2returnsf64, notResult<f64, VectorError>, soreturn cosine_similarity_avx2(from, to)would be a compile error on any AVX2 target. Both the empty-vector and dimension-mismatch guards are correctly placed before this branch, but the new magnitude/NaN protections need to be applied insidecosine_similarity_avx2(or it should be changed to returnResult<f64, VectorError>) for the fix to be complete.Last reviewed commit: 02b8a11
(2/5) Greptile learns from your feedback when you react with thumbs up/down!