-
Notifications
You must be signed in to change notification settings - Fork 33
Rewrite binary vector example #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yhmo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThe PR modifies examples/src/v2/vector_binary.cpp: the ID field schema is changed from VARCHAR to INT64; column- and row-based insertions are updated to use int64 IDs and Int64FieldData; query construction and ID extraction use numeric IDs. Two anonymous-namespace helpers, ConvertToBinaryVector and ConvertToBoolArray, were added to pack/unpack vector to/from vector<uint8_t> and are used to store and validate the binary-vector field. Query result processing decodes binary vectors back to booleans and compares them to the original per-row booleans, emitting diagnostic output on mismatches. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
examples/src/v2/vector_binary.cpp (1)
176-190: Consider using the 4-parameter version of std::equal for safer comparison.The verification logic works correctly for the current dimension=128. However, the 3-parameter
std::equalcall on line 182 assumesoriginal_boolsis at least as long asbools. If dimension were ever changed to a non-multiple of 8,ConvertToBoolArraywould return extra padding bits, makingboolslonger thanoriginal_bools, which could lead to undefined behavior.🔎 Proposed fix for more robust comparison
- if (!std::equal(bools.begin(), bools.end(), original_bools.begin())) { + // Only compare the first 'dimension' elements to ignore any padding bits + if (bools.size() < dimension || + !std::equal(bools.begin(), bools.begin() + dimension, original_bools.begin(), original_bools.end())) { std::cout << "Output vector is not equal to the original!" << std::endl;Alternatively, truncate
boolsto dimension before comparison:auto bools = ConvertToBoolArray(binary); + bools.resize(dimension); // Truncate to dimension to remove padding bits auto id = row[field_id].get<int64_t>();
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/src/v2/vector_binary.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and test AMD64 Fedora 39
- GitHub Check: Test all with coverage (22.04)
- GitHub Check: Build and test windows
- GitHub Check: Summary
🔇 Additional comments (5)
examples/src/v2/vector_binary.cpp (5)
82-82: LGTM!Changing the primary key from VARCHAR to INT64 simplifies the example and aligns with typical numeric ID use cases.
105-111: LGTM!The column-based insertion correctly uses INT64 field data and int64_t IDs, consistent with the schema change.
121-126: LGTM!Good approach to store the original boolean vectors for verification. The code works correctly since dimension=128 is a multiple of 8.
133-135: LGTM!Row-based insertion correctly uses int64_t IDs and the ConvertToBinaryVector utility to encode boolean vectors.
153-155: LGTM!Query filter correctly retrieves and uses int64_t IDs with numeric literal formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
examples/src/v2/vector_binary.cpp (1)
24-62: Fix typos in comments (previously flagged).The typos mentioned in the previous review are still present:
- Line 30: "euqal" → "equal"
- Line 31: "8X of dimension" is misleading (should clarify it's dimension/8 rounded up)
- Line 48: "fromt" → "from"
🧹 Nitpick comments (1)
examples/src/v2/vector_binary.cpp (1)
50-60: Consider adding a dimension parameter for robustness.
ConvertToBoolArrayalways returnsbinary.size() * 8booleans. If the vector dimension is not a multiple of 8, the returned vector will include padding bits that weren't part of the original data. While this works correctly withdimension=128, adding adimensionparameter to trim the result would make the function more robust for other use cases.🔎 Suggested enhancement
std::vector<bool> -ConvertToBoolArray(const std::vector<uint8_t>& binary) { +ConvertToBoolArray(const std::vector<uint8_t>& binary, size_t dimension) { std::vector<bool> bits; - bits.reserve(binary.size() * 8); + bits.reserve(dimension); for (uint8_t byte : binary) { for (int i = 0; i < 8; i++) { + if (bits.size() >= dimension) break; bool bit_is_set = (byte >> i) & 1; bits.push_back(bit_is_set); } + if (bits.size() >= dimension) break; } return bits; }Then update the call site at line 180:
-auto bools = ConvertToBoolArray(binary); +auto bools = ConvertToBoolArray(binary, dimension);
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/src/v2/vector_binary.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Summary
🔇 Additional comments (5)
examples/src/v2/vector_binary.cpp (5)
82-82: LGTM: ID field type change to INT64.Changing the primary key from
VARCHARtoINT64is a good improvement. Numeric IDs are more efficient and semantically appropriate for this use case.
105-111: LGTM: Column-based insertion updated correctly.The changes to use
int64_tIDs andInt64FieldDataare correct and consistent with the schema change.
121-139: LGTM: Row-based insertion and validation preparation.The pre-generation of original boolean vectors in
bools_arrayfor later validation is a good practice. The row data construction correctly usesint64_tIDs and theConvertToBinaryVectorhelper.
154-156: LGTM: Query filter construction updated correctly.The query filter correctly uses
int64_tIDs extracted from the row data and constructs a numeric IN clause.
178-190: LGTM: Validation logic is well-designed.The validation logic correctly:
- Decodes binary vectors back to booleans using
ConvertToBoolArray- Retrieves the row ID as
int64_t- Compares decoded vectors against the original stored vectors
- Provides clear diagnostic output on mismatch
The code correctly assumes that query results contain only row-based insert data (IDs 0–9), which matches the constructed filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/src/v2/vector_binary.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Summary
🔇 Additional comments (4)
examples/src/v2/vector_binary.cpp (4)
82-82: LGTM! Schema change is consistent.The ID field type change from VARCHAR to INT64 is intentional and all related code has been updated accordingly.
105-111: LGTM! Column-based insertion updated correctly.The ID data type and field data type have been updated consistently with the schema change.
121-128: LGTM! Ground truth storage added.Storing the original boolean vectors for later validation is a good practice for this example.
130-149: LGTM! Row-based insertion updated correctly.The ID type change to int64_t and the use of ConvertToBinaryVector are both correct and consistent with the schema.
| std::cout << "\tRow: " << row << std::endl; | ||
| auto binary = row[field_vector].get<std::vector<uint8_t>>(); | ||
| auto bools = ConvertToBoolArray(binary); | ||
| auto id = row[field_id].get<int64_t>(); | ||
| auto original_bools = bools_array.at(id); | ||
| if (!std::equal(bools.begin(), bools.end(), original_bools.begin())) { | ||
| std::cout << "Output vector is not equal to the original!" << std::endl; | ||
| std::cout << "\tOutput vector: "; | ||
| util::PrintList(bools); | ||
| std::cout << "\tOriginal vector: "; | ||
| util::PrintList(original_bools); | ||
| exit(1); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider defensive validation for robustness.
The validation logic has two minor issues that could cause crashes if the code is modified or unexpected data is returned:
-
Line 183:
bools_array.at(id)assumesidis in range [0, 9]. While the query filter ensures only row-based IDs are returned, an unexpected ID would throwstd::out_of_range. Consider adding a bounds check. -
Line 184: The 2-parameter
std::equaldoesn't verify that both ranges have the same size. If sizes differ (e.g., due to unexpected binary vector size from server), this could cause undefined behavior or incorrect validation. Use the 4-parameter version to compare sizes:std::equal(bools.begin(), bools.end(), original_bools.begin(), original_bools.end()).
🔎 Proposed defensive validation
for (const auto& row : output_rows) {
std::cout << "\tRow: " << row << std::endl;
auto binary = row[field_vector].get<std::vector<uint8_t>>();
auto bools = ConvertToBoolArray(binary);
auto id = row[field_id].get<int64_t>();
+ if (id < 0 || id >= static_cast<int64_t>(bools_array.size())) {
+ std::cout << "Unexpected ID " << id << " outside valid range [0, " << bools_array.size() << ")" << std::endl;
+ exit(1);
+ }
auto original_bools = bools_array.at(id);
- if (!std::equal(bools.begin(), bools.end(), original_bools.begin())) {
+ if (!std::equal(bools.begin(), bools.end(), original_bools.begin(), original_bools.end())) {
std::cout << "Output vector is not equal to the original!" << std::endl;
std::cout << "\tOutput vector: ";
util::PrintList(bools);
std::cout << "\tOriginal vector: ";
util::PrintList(original_bools);
exit(1);
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::cout << "\tRow: " << row << std::endl; | |
| auto binary = row[field_vector].get<std::vector<uint8_t>>(); | |
| auto bools = ConvertToBoolArray(binary); | |
| auto id = row[field_id].get<int64_t>(); | |
| auto original_bools = bools_array.at(id); | |
| if (!std::equal(bools.begin(), bools.end(), original_bools.begin())) { | |
| std::cout << "Output vector is not equal to the original!" << std::endl; | |
| std::cout << "\tOutput vector: "; | |
| util::PrintList(bools); | |
| std::cout << "\tOriginal vector: "; | |
| util::PrintList(original_bools); | |
| exit(1); | |
| } | |
| } | |
| std::cout << "\tRow: " << row << std::endl; | |
| auto binary = row[field_vector].get<std::vector<uint8_t>>(); | |
| auto bools = ConvertToBoolArray(binary); | |
| auto id = row[field_id].get<int64_t>(); | |
| if (id < 0 || id >= static_cast<int64_t>(bools_array.size())) { | |
| std::cout << "Unexpected ID " << id << " outside valid range [0, " << bools_array.size() << ")" << std::endl; | |
| exit(1); | |
| } | |
| auto original_bools = bools_array.at(id); | |
| if (!std::equal(bools.begin(), bools.end(), original_bools.begin(), original_bools.end())) { | |
| std::cout << "Output vector is not equal to the original!" << std::endl; | |
| std::cout << "\tOutput vector: "; | |
| util::PrintList(bools); | |
| std::cout << "\tOriginal vector: "; | |
| util::PrintList(original_bools); | |
| exit(1); | |
| } | |
| } |
🤖 Prompt for AI Agents
In examples/src/v2/vector_binary.cpp around lines 179 to 192, the validation
assumes id is in-range and that the two vectors are the same length; add a
defensive bounds check before accessing bools_array.at(id) (verify 0 <= id <
bools_array.size() and log/exit on failure), and ensure you compare sizes before
equality (check bools.size() == original_bools.size() and handle/log a size
mismatch), otherwise use the 4-parameter std::equal that considers both ranges;
include clear diagnostic logs when either check fails and exit with nonzero
status.
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (45.12%) is below the target coverage (90.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #422 +/- ##
==========================================
- Coverage 53.47% 45.12% -8.36%
==========================================
Files 52 244 +192
Lines 4432 12491 +8059
Branches 0 1219 +1219
==========================================
+ Hits 2370 5636 +3266
- Misses 2062 6762 +4700
- Partials 0 93 +93 🚀 New features to boost your workflow:
|
Signed-off-by: yhmo <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.