Skip to content

Conversation

@yhmo
Copy link
Collaborator

@yhmo yhmo commented Dec 31, 2025

  • Core invariant: ConvertToBinaryVector and ConvertToBoolArray use the same bit ordering (0th bool → bit 0, …, 7th → bit 7) so binary vectors round-trip exactly; the example relies on this invariant for correct encode/decode and verification (functions at top of file).
  • Logic removed/simplified: replaced ad-hoc string primary keys and previous raw byte generation with deterministic packing and INT64 primary keys — schema now declares pk as INT64, column inserts use Int64FieldData, row-based inserts set row[pk]=i and row[vector]=ConvertToBinaryVector(bools) (insert paths around the insert blocks).
  • Why this does NOT introduce data loss or regression: every inserted binary vector (row-based path) is encoded with ConvertToBinaryVector and later decoded with ConvertToBoolArray during query result processing; the code compares decoded bools against the original bools_array by id and exits on mismatch, ensuring detection of any corruption (query/result handling and validation in the Query result loop).
  • Enhancement/refactor: improves clarity and determinism — explicit binary packing/validation, numeric IN-filter construction for queries, and consistent INT64 PK handling replace redundant string-based PKs and non-deterministic vector generation, reducing memory/alloc overhead and making round-trip behavior explicit.

@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yhmo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Dec 31, 2025

Walkthrough

The PR modifies examples/src/v2/vector_binary.cpp: the ID field schema is changed from VARCHAR to INT64; column- and row-based insertions are updated to use int64 IDs and Int64FieldData; query construction and ID extraction use numeric IDs. Two anonymous-namespace helpers, ConvertToBinaryVector and ConvertToBoolArray, were added to pack/unpack vector to/from vector<uint8_t> and are used to store and validate the binary-vector field. Query result processing decodes binary vectors back to booleans and compares them to the original per-row booleans, emitting diagnostic output on mismatches.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Rewrite binary vector example' clearly and concisely summarizes the main change—rewriting the binary vector example file. It is specific, directly related to the changeset, and accurately reflects the primary objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/src/v2/vector_binary.cpp (1)

176-190: Consider using the 4-parameter version of std::equal for safer comparison.

The verification logic works correctly for the current dimension=128. However, the 3-parameter std::equal call on line 182 assumes original_bools is at least as long as bools. If dimension were ever changed to a non-multiple of 8, ConvertToBoolArray would return extra padding bits, making bools longer than original_bools, which could lead to undefined behavior.

🔎 Proposed fix for more robust comparison
-            if (!std::equal(bools.begin(), bools.end(), original_bools.begin())) {
+            // Only compare the first 'dimension' elements to ignore any padding bits
+            if (bools.size() < dimension || 
+                !std::equal(bools.begin(), bools.begin() + dimension, original_bools.begin(), original_bools.end())) {
                 std::cout << "Output vector is not equal to the original!" << std::endl;

Alternatively, truncate bools to dimension before comparison:

             auto bools = ConvertToBoolArray(binary);
+            bools.resize(dimension);  // Truncate to dimension to remove padding bits
             auto id = row[field_id].get<int64_t>();
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6f74ef and 9c48d7e.

📒 Files selected for processing (1)
  • examples/src/v2/vector_binary.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and test AMD64 Fedora 39
  • GitHub Check: Test all with coverage (22.04)
  • GitHub Check: Build and test windows
  • GitHub Check: Summary
🔇 Additional comments (5)
examples/src/v2/vector_binary.cpp (5)

82-82: LGTM!

Changing the primary key from VARCHAR to INT64 simplifies the example and aligns with typical numeric ID use cases.


105-111: LGTM!

The column-based insertion correctly uses INT64 field data and int64_t IDs, consistent with the schema change.


121-126: LGTM!

Good approach to store the original boolean vectors for verification. The code works correctly since dimension=128 is a multiple of 8.


133-135: LGTM!

Row-based insertion correctly uses int64_t IDs and the ConvertToBinaryVector utility to encode boolean vectors.


153-155: LGTM!

Query filter correctly retrieves and uses int64_t IDs with numeric literal formatting.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
examples/src/v2/vector_binary.cpp (1)

24-62: Fix typos in comments (previously flagged).

The typos mentioned in the previous review are still present:

  • Line 30: "euqal" → "equal"
  • Line 31: "8X of dimension" is misleading (should clarify it's dimension/8 rounded up)
  • Line 48: "fromt" → "from"
🧹 Nitpick comments (1)
examples/src/v2/vector_binary.cpp (1)

50-60: Consider adding a dimension parameter for robustness.

ConvertToBoolArray always returns binary.size() * 8 booleans. If the vector dimension is not a multiple of 8, the returned vector will include padding bits that weren't part of the original data. While this works correctly with dimension=128, adding a dimension parameter to trim the result would make the function more robust for other use cases.

🔎 Suggested enhancement
 std::vector<bool>
-ConvertToBoolArray(const std::vector<uint8_t>& binary) {
+ConvertToBoolArray(const std::vector<uint8_t>& binary, size_t dimension) {
     std::vector<bool> bits;
-    bits.reserve(binary.size() * 8);
+    bits.reserve(dimension);
     for (uint8_t byte : binary) {
         for (int i = 0; i < 8; i++) {
+            if (bits.size() >= dimension) break;
             bool bit_is_set = (byte >> i) & 1;
             bits.push_back(bit_is_set);
         }
+        if (bits.size() >= dimension) break;
     }
     return bits;
 }

Then update the call site at line 180:

-auto bools = ConvertToBoolArray(binary);
+auto bools = ConvertToBoolArray(binary, dimension);
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c48d7e and 9866171.

📒 Files selected for processing (1)
  • examples/src/v2/vector_binary.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Summary
🔇 Additional comments (5)
examples/src/v2/vector_binary.cpp (5)

82-82: LGTM: ID field type change to INT64.

Changing the primary key from VARCHAR to INT64 is a good improvement. Numeric IDs are more efficient and semantically appropriate for this use case.


105-111: LGTM: Column-based insertion updated correctly.

The changes to use int64_t IDs and Int64FieldData are correct and consistent with the schema change.


121-139: LGTM: Row-based insertion and validation preparation.

The pre-generation of original boolean vectors in bools_array for later validation is a good practice. The row data construction correctly uses int64_t IDs and the ConvertToBinaryVector helper.


154-156: LGTM: Query filter construction updated correctly.

The query filter correctly uses int64_t IDs extracted from the row data and constructs a numeric IN clause.


178-190: LGTM: Validation logic is well-designed.

The validation logic correctly:

  • Decodes binary vectors back to booleans using ConvertToBoolArray
  • Retrieves the row ID as int64_t
  • Compares decoded vectors against the original stored vectors
  • Provides clear diagnostic output on mismatch

The code correctly assumes that query results contain only row-based insert data (IDs 0–9), which matches the constructed filter.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9866171 and 3bff387.

📒 Files selected for processing (1)
  • examples/src/v2/vector_binary.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Summary
🔇 Additional comments (4)
examples/src/v2/vector_binary.cpp (4)

82-82: LGTM! Schema change is consistent.

The ID field type change from VARCHAR to INT64 is intentional and all related code has been updated accordingly.


105-111: LGTM! Column-based insertion updated correctly.

The ID data type and field data type have been updated consistently with the schema change.


121-128: LGTM! Ground truth storage added.

Storing the original boolean vectors for later validation is a good practice for this example.


130-149: LGTM! Row-based insertion updated correctly.

The ID type change to int64_t and the use of ConvertToBinaryVector are both correct and consistent with the schema.

Comment on lines +179 to 192
std::cout << "\tRow: " << row << std::endl;
auto binary = row[field_vector].get<std::vector<uint8_t>>();
auto bools = ConvertToBoolArray(binary);
auto id = row[field_id].get<int64_t>();
auto original_bools = bools_array.at(id);
if (!std::equal(bools.begin(), bools.end(), original_bools.begin())) {
std::cout << "Output vector is not equal to the original!" << std::endl;
std::cout << "\tOutput vector: ";
util::PrintList(bools);
std::cout << "\tOriginal vector: ";
util::PrintList(original_bools);
exit(1);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider defensive validation for robustness.

The validation logic has two minor issues that could cause crashes if the code is modified or unexpected data is returned:

  1. Line 183: bools_array.at(id) assumes id is in range [0, 9]. While the query filter ensures only row-based IDs are returned, an unexpected ID would throw std::out_of_range. Consider adding a bounds check.

  2. Line 184: The 2-parameter std::equal doesn't verify that both ranges have the same size. If sizes differ (e.g., due to unexpected binary vector size from server), this could cause undefined behavior or incorrect validation. Use the 4-parameter version to compare sizes: std::equal(bools.begin(), bools.end(), original_bools.begin(), original_bools.end()).

🔎 Proposed defensive validation
         for (const auto& row : output_rows) {
             std::cout << "\tRow: " << row << std::endl;
             auto binary = row[field_vector].get<std::vector<uint8_t>>();
             auto bools = ConvertToBoolArray(binary);
             auto id = row[field_id].get<int64_t>();
+            if (id < 0 || id >= static_cast<int64_t>(bools_array.size())) {
+                std::cout << "Unexpected ID " << id << " outside valid range [0, " << bools_array.size() << ")" << std::endl;
+                exit(1);
+            }
             auto original_bools = bools_array.at(id);
-            if (!std::equal(bools.begin(), bools.end(), original_bools.begin())) {
+            if (!std::equal(bools.begin(), bools.end(), original_bools.begin(), original_bools.end())) {
                 std::cout << "Output vector is not equal to the original!" << std::endl;
                 std::cout << "\tOutput vector: ";
                 util::PrintList(bools);
                 std::cout << "\tOriginal vector: ";
                 util::PrintList(original_bools);
                 exit(1);
             }
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::cout << "\tRow: " << row << std::endl;
auto binary = row[field_vector].get<std::vector<uint8_t>>();
auto bools = ConvertToBoolArray(binary);
auto id = row[field_id].get<int64_t>();
auto original_bools = bools_array.at(id);
if (!std::equal(bools.begin(), bools.end(), original_bools.begin())) {
std::cout << "Output vector is not equal to the original!" << std::endl;
std::cout << "\tOutput vector: ";
util::PrintList(bools);
std::cout << "\tOriginal vector: ";
util::PrintList(original_bools);
exit(1);
}
}
std::cout << "\tRow: " << row << std::endl;
auto binary = row[field_vector].get<std::vector<uint8_t>>();
auto bools = ConvertToBoolArray(binary);
auto id = row[field_id].get<int64_t>();
if (id < 0 || id >= static_cast<int64_t>(bools_array.size())) {
std::cout << "Unexpected ID " << id << " outside valid range [0, " << bools_array.size() << ")" << std::endl;
exit(1);
}
auto original_bools = bools_array.at(id);
if (!std::equal(bools.begin(), bools.end(), original_bools.begin(), original_bools.end())) {
std::cout << "Output vector is not equal to the original!" << std::endl;
std::cout << "\tOutput vector: ";
util::PrintList(bools);
std::cout << "\tOriginal vector: ";
util::PrintList(original_bools);
exit(1);
}
}
🤖 Prompt for AI Agents
In examples/src/v2/vector_binary.cpp around lines 179 to 192, the validation
assumes id is in-range and that the two vectors are the same length; add a
defensive bounds check before accessing bools_array.at(id) (verify 0 <= id <
bools_array.size() and log/exit on failure), and ensure you compare sizes before
equality (check bools.size() == original_bools.size() and handle/log a size
mismatch), otherwise use the 4-parameter std::equal that considers both ranges;
include clear diagnostic logs when either check fails and exit with nonzero
status.

@codecov
Copy link

codecov bot commented Dec 31, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.12%. Comparing base (a0592aa) to head (92f99cb).
⚠️ Report is 70 commits behind head on master.

❌ Your project status has failed because the head coverage (45.12%) is below the target coverage (90.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #422      +/-   ##
==========================================
- Coverage   53.47%   45.12%   -8.36%     
==========================================
  Files          52      244     +192     
  Lines        4432    12491    +8059     
  Branches        0     1219    +1219     
==========================================
+ Hits         2370     5636    +3266     
- Misses       2062     6762    +4700     
- Partials        0       93      +93     

see 261 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mergify mergify bot added the ci-passed label Dec 31, 2025
@yhmo yhmo added the lgtm label Jan 4, 2026
@sre-ci-robot sre-ci-robot merged commit 2e3e7bf into milvus-io:master Jan 4, 2026
10 checks passed
@yhmo yhmo deleted the ma branch January 4, 2026 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants