Skip to content

GODRIVER-3587 Use raw bytes in valueReader #2120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 42 commits into
base: master
Choose a base branch
from

Conversation

prestonvasquez
Copy link
Member

@prestonvasquez prestonvasquez commented Jul 7, 2025

GODRIVER-3587

Summary

In v2 we provided users with the ability to stream data into the ValueReader (see the temporarily removed "stream" test for more information). Because of this we can't simply revert to the v1 solution, i.e. using raw bytes instead of bufio. This PR suggests that we add bufferedValueReader and streamingValueReader to optimize workflow and avoid backwards breaking changes, respectively. Both must implement the minimal set of functions required by valueReader:

type valueReaderByteSrc interface {
	io.Reader
	io.ByteReader

	// Peek returns the next n bytes without advancing the cursor. It must return
	// exactly n bytes or n error if fewer are available.
	peek(n int) ([]byte, error)

	// discard advanced the cursor by n bytes, returning the actual number
	// discarded or an error if fewer were available.
	discard(n int) (int, error)

	// readSlice reads until (and including) the first occurrence of delim,
	// returning the entire slice [start...delimiter] and advancing the cursor.
	// past it. Returns an error if delim is not found.
	readSlice(delim byte) ([]byte, error)

	// pos returns the number of bytes consumed so far.
	pos() int64

	// regexLength returns the total byte length of a BSON regex value (two
	// C-strings including their terminating NULs) in buffered mode.
	regexLength() (int32, error)

	// streamable returns true if this source can be used in a streaming context.
	streamable() bool

	// reset resets the source to its initial state.
	reset()
}

And then valueReader will be updated to maintain a valueReaderByteSrc implementation object:

type valueReader struct {
	src   valueReaderByteSrc
	stack []vrState // stack of vrState, one for each frame
	frame int64
}

NewDocumentReader will use streamingValueReader, since that is the current expectation. And newDocumentReader will use bufferedValueReader to optimize in the non-default case.

If accepted, streamingValueReader will be added in a subsequent PR.

Benchmarking

Benchmark: https://gist.github.com/prestonvasquez/77eecc18699736b3d7d62a7a30c5769e

streamingValueReader indicates the same regression noted in the analysis:

pkg: github.com/prestonvasquez/dev/mongo-go-driver/noreplace/inline/bson
cpu: Apple M1 Pro
                               │   v1.txt    │       v2.txt       │
                               │   sec/op    │   sec/op     vs base   │
BSONv1vs2Comparison/BSON_v1-10   3.375µ ± 2%
BSONv1vs2Comparison/BSON_v2-10                 4.862µ ± 3%
geomean                          3.375µ        4.862µ       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

                               │    v1.txt    │       v2.txt        │
                               │     B/op     │     B/op      vs base   │
BSONv1vs2Comparison/BSON_v1-10   1.328Ki ± 0%
BSONv1vs2Comparison/BSON_v2-10                  5.502Ki ± 0%
geomean                          1.328Ki        5.502Ki       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

                               │   v1.txt   │      v2.txt       │
                               │ allocs/op  │ allocs/op   vs base   │
BSONv1vs2Comparison/BSON_v1-10   47.00 ± 0%
BSONv1vs2Comparison/BSON_v2-10                51.00 ± 0%
geomean                          47.00        51.00       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

bufferedValueReader is still worse than V1 but largely improved:

pkg: github.com/prestonvasquez/dev/mongo-go-driver/noreplace/inline/bson
cpu: Apple M1 Pro
                               │   v1.txt    │       v2.txt       │
                               │   sec/op    │   sec/op     vs base   │
BSONv1vs2Comparison/BSON_v1-10   3.339µ ± 1%
BSONv1vs2Comparison/BSON_v2-10                 3.699µ ± 1%
geomean                          3.339µ        3.699µ       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

                               │    v1.txt    │       v2.txt        │
                               │     B/op     │     B/op      vs base   │
BSONv1vs2Comparison/BSON_v1-10   1.328Ki ± 0%
BSONv1vs2Comparison/BSON_v2-10                  1.721Ki ± 0%
geomean                          1.328Ki        1.721Ki       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

                               │   v1.txt   │      v2.txt       │
                               │ allocs/op  │ allocs/op   vs base   │
BSONv1vs2Comparison/BSON_v1-10   47.00 ± 0%
BSONv1vs2Comparison/BSON_v2-10                48.00 ± 0%
geomean                          47.00        48.00       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

Background & Motivation

The v2 solution is less optimized because of the allocation requirements for streaming.

@mongodb-drivers-pr-bot mongodb-drivers-pr-bot bot added the priority-3-low Low Priority PR for Review label Jul 7, 2025
Copy link
Contributor

API Change Report

No changes found!

@prestonvasquez prestonvasquez added priority-1-high High Priority PR for Review and removed priority-3-low Low Priority PR for Review labels Jul 8, 2025
@prestonvasquez prestonvasquez marked this pull request as ready for review July 8, 2025 01:56
@prestonvasquez prestonvasquez requested a review from a team as a code owner July 8, 2025 01:56
@prestonvasquez
Copy link
Member Author

@matthewdale , @qingyang-hu This is a high priority ticket. It could be helpful to start with the description and determine if we should accept / reject the design.

Comment on lines +37 to +39
if b.offset >= int64(len(b.buf)) {
return 0, io.EOF
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts about returning the remaining buf as well as the EOF for ReadByte(), peek() and discard() when there are fewer bytes than required. That's also how bufio acts.

Copy link
Collaborator

@matthewdale matthewdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept makes sense and the API for wrapping the data slice improves the logic. There are a few improvements and rough edges that need to be worked out.

Comment on lines 91 to 95
func newBufferedDocumentReader(r io.Reader) *valueReader {
buf, err := io.ReadAll(r)
if err != nil {
panic("bson: could not read document: " + err.Error())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to only be called with byte slices that are wrapped by a bytes.Reader. Can we make this accept the byte slice directly instead?

Suggested change
func newBufferedDocumentReader(r io.Reader) *valueReader {
buf, err := io.ReadAll(r)
if err != nil {
panic("bson: could not read document: " + err.Error())
}
func newBufferedDocumentReader(b []byte) *valueReader {
buf := b

Comment on lines 148 to 150
buf, err := io.ReadAll(r)
if err != nil {
panic("bson: could not read value: " + err.Error())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to return this error, not panic. We should defer reading the data from the reader until someone calls one of the Read* methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

Comment on lines 180 to 181
if !errors.Is(err, io.EOF) {
t.Errorf("Expected io.ErrUnexpectedEOF with document length too small. got %v; want %v", err, io.EOF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read* methods should return io.ErrUnexpectedEOF if the data ends before the requested value can be read. From the io.EOF docs:

... Functions should return EOF only to signal a graceful end of input. If the EOF occurs unexpectedly in a structured data stream, the appropriate error is either [ErrUnexpectedEOF] or some other error giving more detail.

Copy link
Member Author

@prestonvasquez prestonvasquez Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the buffered path, EOF is returned from peek, not from attempting a read. Additionally, using bufio's discard would just return io.EOF if a user tries to discard more data than is on the buffer. Since there is no precedence for buffered peek, we just adopted the discard solution from bufio. This is also inline with the v1 implementation.

func TestDiscard_EOF(t *testing.T) {
	testData := bytes.Repeat([]byte("a"), 1024)
	reader := bufio.NewReader(bytes.NewReader(testData))

	n, err := reader.Discard(10000)
	assert.Error(t, err)
	assert.Equal(t, err, io.EOF)
	assert.NotEqual(t, err, io.ErrUnexpectedEOF)
	assert.Equal(t, 1024-100, n, "Expected to discard all bytes before EOF")
}

TL;DR we don't read anything in the buffered path, so it's unclear why we would use io. ErrUnexpectedEOF.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was talking about the structured Read* methods, like ReadDocument or ReadBinary. I see now that many of the existing tests expect io.EOF from those methods (example), but it's inconsistent between the different methods. We should either preserve the existing error behavior or standardize on returning io.UnexpectedEOF. The unstructured stream readers (streamingValueReader and bufferedValueReader) should return io.EOF.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewdale I've created GODRIVER-3613 to investigate this further, per our conversation.

Copy link
Collaborator

@matthewdale matthewdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some optional suggestions, but over all looks good! 👍

Comment on lines +1595 to +1600
{
"Array/invalid length",
TypeArray,
[]byte{0x01, 0x02, 0x03},
io.EOF, 0, 0,
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Keyed struct literals are way easier to read than unkeyed because you don't need to remember the struct format. I recommend adding the field names for these test cases.

t.Errorf("Offset not set at correct position; got %d; want %d", vr.src.pos(), tc.offset)
}
})
t.Run("ReadBytes", func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Lots of levels of subtests makes the code difficult to read because of the extra indentation and very long functions. Consider making these subtests separate test functions instead.

})

// This test is too complicated to try to abstract using subtests.
t.Run("ReadBytes & Skip (streaming)", func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Consider making this a separate test function to make the test code easier to read and run.

Comment on lines 870 to 874
(*valueReader).ReadString,
"",
io.ErrUnexpectedEOF,
io.EOF,
TypeString,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only case where the error differs between the streaming and buffered solutions?

Optional: Consider updating the logic to always return io.EOF. That should allow removing streamingErr and bufferedError from the test cases.

// bufferedValueReader implements the low-level byteSrc interface by reading
// directly from an in-memory byte slice. It provides efficient, zero-copy
// access for parsing BSON when the entire document is buffered in memory.
type bufferedValueReader struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: This isn't a ValueReader, so the name is a bit confusing. Consider a name like bufferedByteSrc.

//
// Note: this approach trades memory usage for extra buffering and reader calls,
// so it is less performanted than the in-memory bufferedValueReader.
type streamingValueReader struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: This isn't a ValueReader, so the name is a bit confusing. Consider a name like streamingByteSrc.

@@ -16,6 +16,38 @@ import (
"sync"
)

type valueReaderByteSrc interface {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Consider a more concise name like byteSrc.

Comment on lines +130 to +132
vr.src = &bufferedValueReader{}
vr.src.(*bufferedValueReader).buf = b
vr.src.(*bufferedValueReader).offset = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Suggested simplification.

Suggested change
vr.src = &bufferedValueReader{}
vr.src.(*bufferedValueReader).buf = b
vr.src.(*bufferedValueReader).offset = 0
vr.src = &bufferedValueReader{
buf: b,
offset: 0,
}

Copy link
Collaborator

@qingyang-hu qingyang-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority-1-high High Priority PR for Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants