-
Notifications
You must be signed in to change notification settings - Fork 7
Invalid boolean values #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…oblem with the Writer.
…er also requires continuous bitwriter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Ericvf.
I went through your code and, although the change will fix the boolean writer, this seems to be a broader issue that may affect other data types as well. The root cause are the fixed stream positions.
As an example, consider a boolean stream with a stride of 10. If we add 15 values to it, it will result in three bytes being written to the stream, where the second byte will have only two bits that represent valid data. The remaining 6 bits are expected to be skipped by the next row, but since the positions have fixed values, that information is not available for the reader.
We have this implemented in the optimized reader version, but not for the writer.
Thanks for bringing this up.
@@ -14,10 +14,11 @@ public class BooleanColumn_Test | |||
[Fact] | |||
public void RoundTrip_BooleanColumn() | |||
{ | |||
RoundTripSingleBool(70000); | |||
RoundTripSingleBool(70000, 1000); | |||
RoundTripSingleBool(70000, 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the original test, and add a new one, for the problematic use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've restored the original test case with a rowIndexStride of 10000.
Thats good to hear. Since this bug seems solved for at least the writer, does that mean you would like to merge it? We are using this package to write data into Glad to be of help. Thanks for the contributions on this package! |
Hello, @Ericvf Sorry for the delay to reply.. The PR was not merged because the problem is actually more broad than only the boolean writer. The optimal solution for that would be updating all writers to properly track stream positions. That would reflect the proper number of values that should be offset/skipped by readers. If you are willing to submit a new PR with that, I would gladly review those again, since it would perfectly pair with the optimized reader from PR #16. Thanks again for shedding some light in this issue. |
This PR fixes a problem in the BooleanWriter that happens when the RowIndexStride is not divisible by 8 or when using a nullable Boolean.
What can happen is that the BooleanWriter can receive a number of Boolean rows that does not fit within a stream of bytes. When that happens unpopulated bits in that byte are incorrectly written to the underlying data buffer.
To demonstrate the problem a unit test was changed. The first commit has no changes to the code other than the conditions of the unit test. The unit test will fail and the rows read from the stream are not equal to the rows previously written to the stream.
The solution is to only write a byte to the underlying data buffer when its fully populated by 8 bits. The second commit provides an implementation for that.
The problem also happens when the RowIndexStride is divisible by 8, but a nullable Boolean column is used. In this case the number of rows in the data buffer is smaller than the number of rows in the presenter buffer. Therefore the data buffer will also suffer from the symptom above, where not fully populated bytes can be written to the data buffer.