Skip to content

fix(vector-core): harden cosine similarity with input validation, zero-magnitude handling, and edge case tests#881

Open
bhaktofmahakal wants to merge 4 commits intoHelixDB:mainfrom
bhaktofmahakal:bhaktofmahakal/opensource-to-helix
Open

fix(vector-core): harden cosine similarity with input validation, zero-magnitude handling, and edge case tests#881
bhaktofmahakal wants to merge 4 commits intoHelixDB:mainfrom
bhaktofmahakal:bhaktofmahakal/opensource-to-helix

Conversation

@bhaktofmahakal
Copy link

@bhaktofmahakal bhaktofmahakal commented Mar 7, 2026

Summary

Hardens the vector core module with input validation, proper error handling for edge cases, and bug fixes that prevent silent data corruption and misleading results in cosine similarity calculations.

Problem

The vector core had several correctness issues:

  1. Zero-magnitude vectors silently returned -1.0 from cosine_similarity() — this is mathematically undefined and misleading (treated as "most dissimilar" instead of signaling an error)
  2. Empty vectors could be inserted into the HNSW index and used in search queries, causing NaN propagation in downstream distance calculations
  3. No NaN/Infinity guards on the computed similarity output
  4. Stale println! in production pathprintln!("mis-match in vector dimensions!") was left in cosine_similarity()
  5. is_none() || unwrap() anti-pattern in select_neighbors — logically correct but fragile and non-idiomatic
  6. Incomplete boundary check in get_all_vectors — checked prefix + 16 bytes but key format requires prefix + 16 + 8
  7. Typo"emtpy search result" in two error messages

Changes

vector_distance.rs — Cosine similarity hardening

  • Empty vectors → Err(InvalidVectorData) (was: proceed to divide-by-zero)
  • Zero/near-zero magnitude → Err(InvalidVectorData) using f64::EPSILON threshold (was: Ok(-1.0))
  • Added NaN/Infinity guard on result
  • Removed stale println!

vector_core.rs — Structural improvements

  • Added empty vector validation in both insert() and search() entry points
  • Fixed "emtpy""empty" in 2 error messages
  • Replaced filter.is_none() || filter.unwrap()... with idiomatic filter.as_ref().map_or(true, ...)
  • Removed dead commented-out code block in select_neighbors
  • Fixed get_all_vectors key length check: prefix + 16prefix + 16 + 8

Tests — 12 new edge case tests

  • Zero-magnitude, near-zero magnitude, empty, both-empty, one-empty vectors → error
  • Dimension mismatch → error
  • Identical, opposite, orthogonal, single-element, large-dimension (1024-d) vectors → correct results
  • Updated test_hvector_distance_max to use opposite vectors (was using zero-magnitude)
  • Updated test_upsert_v to expect error on empty vector data

Test Results

All 1228 tests pass (0 failures), verified with:

cargo test --lib -p helix-db -- --skip concurrency_tests --skip hnsw_tests --skip worker_pool_tests


Summary of What Changed (4 files)

File What Changed
vector_distance.rs Empty/zero-magnitude vectors now return Err instead of -1.0; removed println!; added NaN guard
vector_core.rs Added empty data validation in insert() & search(); fixed typo "emtpy"→"empty"; fixed key length check; idiomatic filter pattern
vector_tests.rs Added 12 new edge case tests for cosine similarity
upsert_tests.rs Updated test to expect error on empty vector data

Greptile Summary

This PR hardens the vector core module by adding input validation and proper error handling to cosine_similarity and the HNSW insert/search entry points, replacing a silent incorrect return value (Ok(-1.0) for zero-magnitude vectors) with explicit errors, and adding 12 new edge-case tests.

Key changes:

  • vector_distance.rs: Adds empty-vector, zero-magnitude (< f64::EPSILON), and NaN/Inf guards to the scalar cosine similarity path; removes the stale println!
  • vector_core.rs: Empty data validation added in insert() and search(); fixes boundary check in get_all_vectors (+16+16+8); fixes two "emtpy" typos; replaces is_none() || unwrap() anti-pattern with idiomatic map_or
  • vector_tests.rs: 12 new tests for edge cases; test_hvector_distance_max updated to use valid opposite vectors
  • upsert_tests.rs: Test updated to assert that empty vector data now returns an error

Issue found:

  • The new zero-magnitude check and NaN guard in vector_distance.rs are placed after the #[cfg(target_feature = "avx2")] early return, meaning they are completely bypassed on AVX2 targets. Additionally, cosine_similarity_avx2 still has a return type of f64 rather than Result<f64, VectorError>, which would produce a compile error on any AVX2-capable build. The empty-vector and dimension-mismatch guards are correctly placed before the AVX2 branch, but the magnitude and NaN protections need to be replicated inside the AVX2 path (or the function signature updated) to make the fix complete.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["cosine_similarity(from, to)"] --> B{Either input empty?}
    B -- Yes --> C["Err(InvalidVectorData)"]
    B -- No --> D{Lengths equal?}
    D -- No --> E["Err(InvalidVectorLength)"]
    D -- Yes --> F{AVX2 feature enabled?}
    F -- Yes --> G["cosine_similarity_avx2(from, to)\nreturns f64 ⚠️\nNo magnitude/NaN guard"]
    F -- No --> H["Scalar loop:\ncompute dot_product,\nmagnitude_a, magnitude_b"]
    H --> I{"magnitude_a < ε\nor magnitude_b < ε?"}
    I -- Yes --> J["Err(InvalidVectorData)"]
    I -- No --> K["similarity = dot / (√mag_a × √mag_b)"]
    K --> L{NaN or Infinite?}
    L -- Yes --> M["Err(InvalidVectorData)"]
    L -- No --> N["Ok(similarity)"]

    style G fill:#ffcccc,stroke:#cc0000
    style C fill:#ffe0e0
    style E fill:#ffe0e0
    style J fill:#ffe0e0
    style M fill:#ffe0e0
    style N fill:#ccffcc,stroke:#00aa00
Loading

Comments Outside Diff (1)

  1. helix-db/src/helix_engine/vector_core/vector_distance.rs, line 39-42 (link)

    New guards bypass AVX2 path

    The zero-magnitude check (line 83) and NaN/Inf guard (line 89) added by this PR are placed after the #[cfg(target_feature = "avx2")] early return on line 42. When compiled with AVX2 support enabled, cosine_similarity_avx2 is called instead, which has no magnitude check or NaN guard — a zero-magnitude vector on an AVX2 target would still produce a NaN or inf result rather than Err(InvalidVectorData).

    Additionally, cosine_similarity_avx2 returns f64, not Result<f64, VectorError>, so return cosine_similarity_avx2(from, to) would be a compile error on any AVX2 target. Both the empty-vector and dimension-mismatch guards are correctly placed before this branch, but the new magnitude/NaN protections need to be applied inside cosine_similarity_avx2 (or it should be changed to return Result<f64, VectorError>) for the fix to be complete.

    #[cfg(target_feature = "avx2")]
    {
        return cosine_similarity_avx2(from, to);
        // ^^^ Returns f64, not Result<f64, VectorError> — compile error on AVX2 targets
        //     Also: no zero-magnitude or NaN guard in cosine_similarity_avx2
    }

Last reviewed commit: 02b8a11

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

Copilot AI and others added 3 commits March 7, 2026 13:26
…, fix typos, add edge case tests

Co-authored-by: bhaktofmahakal <113044681+bhaktofmahakal@users.noreply.github.com>
…es, add near-zero test

Co-authored-by: bhaktofmahakal <113044681+bhaktofmahakal@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 7, 2026 14:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the vector-core cosine similarity and HNSW entry points to avoid undefined/invalid similarity results (e.g., empty or zero-magnitude vectors) and adds regression tests for key edge cases.

Changes:

  • Add input validation and NaN/Infinity guards to cosine_similarity, and remove a stale println!.
  • Reject empty vector data at HNSW insert() and search() entry points; fix a key-length boundary check and typo in error messages.
  • Add/adjust tests to cover cosine-similarity edge cases and updated upsert behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
helix-db/src/helix_engine/vector_core/vector_distance.rs Adds validation/guards to cosine similarity and removes debug output (but AVX2 path needs alignment).
helix-db/src/helix_engine/vector_core/vector_core.rs Rejects empty vectors in insert/search, fixes key length check, typo fix, and makes filter handling more idiomatic.
helix-db/src/helix_engine/tests/vector_tests.rs Adds cosine-similarity edge case tests and updates max-distance test to avoid zero vectors.
helix-db/src/helix_engine/tests/traversal_tests/upsert_tests.rs Updates upsert test to expect an error for empty vector data.
Comments suppressed due to low confidence (1)

helix-db/src/helix_engine/vector_core/vector_distance.rs:42

  • In the AVX2 fast-path, cosine_similarity returns cosine_similarity_avx2(from, to) but cosine_similarity_avx2 currently returns f64, not Result<f64, VectorError>. This will fail to compile whenever target_feature = "avx2" is enabled, and it also bypasses the new zero/near-zero magnitude and NaN/Infinity validation added in the scalar path. Update the AVX2 implementation (and the call site) to return Result<f64, VectorError> and apply the same validation semantics as the scalar implementation so behavior is consistent across builds.
    #[cfg(target_feature = "avx2")]
    {
        return cosine_similarity_avx2(from, to);
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…alidation with scalar path

Address review feedback from Copilot and Greptile reviewers on PR HelixDB#881:
- Change cosine_similarity_avx2 return type from f64 to Result<f64, VectorError>
- Add zero/near-zero magnitude check (f64::EPSILON threshold) in AVX2 path
- Add NaN/Infinity guard in AVX2 path
- Ensures consistent behavior across scalar and SIMD code paths

Co-authored-by: bhaktofmahakal <113044681+bhaktofmahakal@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants