-
Notifications
You must be signed in to change notification settings - Fork 916
GODRIVER-3587 Use raw bytes in valueReader #2120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
API Change ReportNo changes found! |
@matthewdale , @qingyang-hu This is a high priority ticket. It could be helpful to start with the description and determine if we should accept / reject the design. |
if b.offset >= int64(len(b.buf)) { | ||
return 0, io.EOF | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts about returning the remaining buf as well as the EOF for ReadByte()
, peek()
and discard()
when there are fewer bytes than required. That's also how bufio
acts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept makes sense and the API for wrapping the data slice improves the logic. There are a few improvements and rough edges that need to be worked out.
bson/value_reader.go
Outdated
func newBufferedDocumentReader(r io.Reader) *valueReader { | ||
buf, err := io.ReadAll(r) | ||
if err != nil { | ||
panic("bson: could not read document: " + err.Error()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to only be called with byte slices that are wrapped by a bytes.Reader
. Can we make this accept the byte slice directly instead?
func newBufferedDocumentReader(r io.Reader) *valueReader { | |
buf, err := io.ReadAll(r) | |
if err != nil { | |
panic("bson: could not read document: " + err.Error()) | |
} | |
func newBufferedDocumentReader(b []byte) *valueReader { | |
buf := b |
bson/value_reader.go
Outdated
buf, err := io.ReadAll(r) | ||
if err != nil { | ||
panic("bson: could not read value: " + err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way to return this error, not panic. We should defer reading the data from the reader until someone calls one of the Read*
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
27e063b
to
620eb5a
Compare
bson/value_reader_test.go
Outdated
if !errors.Is(err, io.EOF) { | ||
t.Errorf("Expected io.ErrUnexpectedEOF with document length too small. got %v; want %v", err, io.EOF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read*
methods should return io.ErrUnexpectedEOF
if the data ends before the requested value can be read. From the io.EOF
docs:
... Functions should return EOF only to signal a graceful end of input. If the EOF occurs unexpectedly in a structured data stream, the appropriate error is either [ErrUnexpectedEOF] or some other error giving more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the buffered path, EOF is returned from peek
, not from attempting a read. Additionally, using bufio's discard would just return io.EOF
if a user tries to discard more data than is on the buffer. Since there is no precedence for buffered peek
, we just adopted the discard
solution from bufio. This is also inline with the v1 implementation.
func TestDiscard_EOF(t *testing.T) {
testData := bytes.Repeat([]byte("a"), 1024)
reader := bufio.NewReader(bytes.NewReader(testData))
n, err := reader.Discard(10000)
assert.Error(t, err)
assert.Equal(t, err, io.EOF)
assert.NotEqual(t, err, io.ErrUnexpectedEOF)
assert.Equal(t, 1024-100, n, "Expected to discard all bytes before EOF")
}
TL;DR we don't read anything in the buffered path, so it's unclear why we would use io. ErrUnexpectedEOF
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was talking about the structured Read*
methods, like ReadDocument
or ReadBinary
. I see now that many of the existing tests expect io.EOF
from those methods (example), but it's inconsistent between the different methods. We should either preserve the existing error behavior or standardize on returning io.UnexpectedEOF
. The unstructured stream readers (streamingValueReader
and bufferedValueReader
) should return io.EOF
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewdale I've created GODRIVER-3613 to investigate this further, per our conversation.
Co-authored-by: Matt Dale <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some optional suggestions, but over all looks good! 👍
{ | ||
"Array/invalid length", | ||
TypeArray, | ||
[]byte{0x01, 0x02, 0x03}, | ||
io.EOF, 0, 0, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Keyed struct literals are way easier to read than unkeyed because you don't need to remember the struct format. I recommend adding the field names for these test cases.
t.Errorf("Offset not set at correct position; got %d; want %d", vr.src.pos(), tc.offset) | ||
} | ||
}) | ||
t.Run("ReadBytes", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Lots of levels of subtests makes the code difficult to read because of the extra indentation and very long functions. Consider making these subtests separate test functions instead.
}) | ||
|
||
// This test is too complicated to try to abstract using subtests. | ||
t.Run("ReadBytes & Skip (streaming)", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Consider making this a separate test function to make the test code easier to read and run.
(*valueReader).ReadString, | ||
"", | ||
io.ErrUnexpectedEOF, | ||
io.EOF, | ||
TypeString, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only case where the error differs between the streaming and buffered solutions?
Optional: Consider updating the logic to always return io.EOF
. That should allow removing streamingErr
and bufferedError
from the test cases.
// bufferedValueReader implements the low-level byteSrc interface by reading | ||
// directly from an in-memory byte slice. It provides efficient, zero-copy | ||
// access for parsing BSON when the entire document is buffered in memory. | ||
type bufferedValueReader struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: This isn't a ValueReader
, so the name is a bit confusing. Consider a name like bufferedByteSrc
.
// | ||
// Note: this approach trades memory usage for extra buffering and reader calls, | ||
// so it is less performanted than the in-memory bufferedValueReader. | ||
type streamingValueReader struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: This isn't a ValueReader
, so the name is a bit confusing. Consider a name like streamingByteSrc
.
@@ -16,6 +16,38 @@ import ( | |||
"sync" | |||
) | |||
|
|||
type valueReaderByteSrc interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Consider a more concise name like byteSrc
.
vr.src = &bufferedValueReader{} | ||
vr.src.(*bufferedValueReader).buf = b | ||
vr.src.(*bufferedValueReader).offset = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Suggested simplification.
vr.src = &bufferedValueReader{} | |
vr.src.(*bufferedValueReader).buf = b | |
vr.src.(*bufferedValueReader).offset = 0 | |
vr.src = &bufferedValueReader{ | |
buf: b, | |
offset: 0, | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
GODRIVER-3587
Summary
In v2 we provided users with the ability to stream data into the
ValueReader
(see the temporarily removed "stream" test for more information). Because of this we can't simply revert to the v1 solution, i.e. using raw bytes instead of bufio. This PR suggests that we addbufferedValueReader
andstreamingValueReader
to optimize workflow and avoid backwards breaking changes, respectively. Both must implement the minimal set of functions required byvalueReader
:And then
valueReader
will be updated to maintain avalueReaderByteSrc
implementation object:NewDocumentReader
will usestreamingValueReader
, since that is the current expectation. AndnewDocumentReader
will usebufferedValueReader
to optimize in the non-default case.If accepted,
streamingValueReader
will be added in a subsequent PR.Benchmarking
Benchmark: https://gist.github.com/prestonvasquez/77eecc18699736b3d7d62a7a30c5769e
streamingValueReader
indicates the same regression noted in the analysis:bufferedValueReader
is still worse than V1 but largely improved:Background & Motivation
The v2 solution is less optimized because of the allocation requirements for streaming.