GODRIVER-3587 Use raw bytes in valueReader #2120

prestonvasquez · 2025-07-07T22:00:28Z

Summary

In v2 we provided users with the ability to stream data into the ValueReader (see the temporarily removed "stream" test for more information). Because of this we can't simply revert to the v1 solution, i.e. using raw bytes instead of bufio. This PR suggests that we add bufferedValueReader and streamingValueReader to optimize workflow and avoid backwards breaking changes, respectively. Both must implement the minimal set of functions required by valueReader:

type valueReaderByteSrc interface {
	io.Reader
	io.ByteReader

	// Peek returns the next n bytes without advancing the cursor. It must return
	// exactly n bytes or n error if fewer are available.
	peek(n int) ([]byte, error)

	// discard advanced the cursor by n bytes, returning the actual number
	// discarded or an error if fewer were available.
	discard(n int) (int, error)

	// readSlice reads until (and including) the first occurrence of delim,
	// returning the entire slice [start...delimiter] and advancing the cursor.
	// past it. Returns an error if delim is not found.
	readSlice(delim byte) ([]byte, error)

	// pos returns the number of bytes consumed so far.
	pos() int64

	// regexLength returns the total byte length of a BSON regex value (two
	// C-strings including their terminating NULs) in buffered mode.
	regexLength() (int32, error)

	// streamable returns true if this source can be used in a streaming context.
	streamable() bool

	// reset resets the source to its initial state.
	reset()
}

And then valueReader will be updated to maintain a valueReaderByteSrc implementation object:

type valueReader struct {
	src   valueReaderByteSrc
	stack []vrState // stack of vrState, one for each frame
	frame int64
}

NewDocumentReader will use streamingValueReader, since that is the current expectation. And newDocumentReader will use bufferedValueReader to optimize in the non-default case.

If accepted, streamingValueReader will be added in a subsequent PR.

Benchmarking

Benchmark: https://gist.github.com/prestonvasquez/77eecc18699736b3d7d62a7a30c5769e

streamingValueReader indicates the same regression noted in the analysis:

pkg: github.com/prestonvasquez/dev/mongo-go-driver/noreplace/inline/bson
cpu: Apple M1 Pro
                               │   v1.txt    │       v2.txt       │
                               │   sec/op    │   sec/op     vs base   │
BSONv1vs2Comparison/BSON_v1-10   3.375µ ± 2%
BSONv1vs2Comparison/BSON_v2-10                 4.862µ ± 3%
geomean                          3.375µ        4.862µ       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

                               │    v1.txt    │       v2.txt        │
                               │     B/op     │     B/op      vs base   │
BSONv1vs2Comparison/BSON_v1-10   1.328Ki ± 0%
BSONv1vs2Comparison/BSON_v2-10                  5.502Ki ± 0%
geomean                          1.328Ki        5.502Ki       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

                               │   v1.txt   │      v2.txt       │
                               │ allocs/op  │ allocs/op   vs base   │
BSONv1vs2Comparison/BSON_v1-10   47.00 ± 0%
BSONv1vs2Comparison/BSON_v2-10                51.00 ± 0%
geomean                          47.00        51.00       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

bufferedValueReader is still worse than V1 but largely improved:

pkg: github.com/prestonvasquez/dev/mongo-go-driver/noreplace/inline/bson
cpu: Apple M1 Pro
                               │   v1.txt    │       v2.txt       │
                               │   sec/op    │   sec/op     vs base   │
BSONv1vs2Comparison/BSON_v1-10   3.339µ ± 1%
BSONv1vs2Comparison/BSON_v2-10                 3.699µ ± 1%
geomean                          3.339µ        3.699µ       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

                               │    v1.txt    │       v2.txt        │
                               │     B/op     │     B/op      vs base   │
BSONv1vs2Comparison/BSON_v1-10   1.328Ki ± 0%
BSONv1vs2Comparison/BSON_v2-10                  1.721Ki ± 0%
geomean                          1.328Ki        1.721Ki       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

                               │   v1.txt   │      v2.txt       │
                               │ allocs/op  │ allocs/op   vs base   │
BSONv1vs2Comparison/BSON_v1-10   47.00 ± 0%
BSONv1vs2Comparison/BSON_v2-10                48.00 ± 0%
geomean                          47.00        48.00       ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

Background & Motivation

The v2 solution is less optimized because of the allocation requirements for streaming.

mongodb-drivers-pr-bot · 2025-07-07T22:18:53Z

API Change Report

No changes found!

prestonvasquez · 2025-07-08T01:57:05Z

@matthewdale , @qingyang-hu This is a high priority ticket. It could be helpful to start with the description and determine if we should accept / reject the design.

qingyang-hu · 2025-07-10T19:12:59Z

bson/buffered_value_reader.go

+	if b.offset >= int64(len(b.buf)) {
+		return 0, io.EOF
+	}


What are your thoughts about returning the remaining buf as well as the EOF for ReadByte(), peek() and discard() when there are fewer bytes than required. That's also how bufio acts.

matthewdale

The concept makes sense and the API for wrapping the data slice improves the logic. There are a few improvements and rough edges that need to be worked out.

matthewdale · 2025-07-11T06:01:05Z

bson/value_reader.go

+func newBufferedDocumentReader(r io.Reader) *valueReader {
+	buf, err := io.ReadAll(r)
+	if err != nil {
+		panic("bson: could not read document: " + err.Error())
+	}


This appears to only be called with byte slices that are wrapped by a bytes.Reader. Can we make this accept the byte slice directly instead?

Suggested change

func newBufferedDocumentReader(r io.Reader) *valueReader {

buf, err := io.ReadAll(r)

if err != nil {

panic("bson: could not read document: " + err.Error())

}

func newBufferedDocumentReader(b []byte) *valueReader {

buf := b

matthewdale · 2025-07-11T06:01:48Z

bson/value_reader.go

+	buf, err := io.ReadAll(r)
+	if err != nil {
+		panic("bson: could not read value: " + err.Error())


We need a way to return this error, not panic. We should defer reading the data from the reader until someone calls one of the Read* methods.

bson/value_reader.go

matthewdale · 2025-07-11T21:57:41Z

bson/value_reader_test.go

+			if !errors.Is(err, io.EOF) {
 				t.Errorf("Expected io.ErrUnexpectedEOF with document length too small. got %v; want %v", err, io.EOF)


Read* methods should return io.ErrUnexpectedEOF if the data ends before the requested value can be read. From the io.EOF docs:

... Functions should return EOF only to signal a graceful end of input. If the EOF occurs unexpectedly in a structured data stream, the appropriate error is either [ErrUnexpectedEOF] or some other error giving more detail.

In the buffered path, EOF is returned from peek, not from attempting a read. Additionally, using bufio's discard would just return io.EOF if a user tries to discard more data than is on the buffer. Since there is no precedence for buffered peek, we just adopted the discard solution from bufio. This is also inline with the v1 implementation.

func TestDiscard_EOF(t *testing.T) { testData := bytes.Repeat([]byte("a"), 1024) reader := bufio.NewReader(bytes.NewReader(testData)) n, err := reader.Discard(10000) assert.Error(t, err) assert.Equal(t, err, io.EOF) assert.NotEqual(t, err, io.ErrUnexpectedEOF) assert.Equal(t, 1024-100, n, "Expected to discard all bytes before EOF") }

TL;DR we don't read anything in the buffered path, so it's unclear why we would use io. ErrUnexpectedEOF.

Sorry, I was talking about the structured Read* methods, like ReadDocument or ReadBinary. I see now that many of the existing tests expect io.EOF from those methods (example), but it's inconsistent between the different methods. We should either preserve the existing error behavior or standardize on returning io.UnexpectedEOF. The unstructured stream readers (streamingValueReader and bufferedValueReader) should return io.EOF.

@matthewdale I've created GODRIVER-3613 to investigate this further, per our conversation.

Co-authored-by: Matt Dale <[email protected]>

matthewdale

Some optional suggestions, but over all looks good! 👍

matthewdale · 2025-07-16T18:31:53Z

bson/value_reader_test.go

+			{
+				"Array/invalid length",
+				TypeArray,
+				[]byte{0x01, 0x02, 0x03},
+				io.EOF, 0, 0,
+			},


Optional: Keyed struct literals are way easier to read than unkeyed because you don't need to remember the struct format. I recommend adding the field names for these test cases.

matthewdale · 2025-07-16T18:33:36Z

bson/value_reader_test.go

+						t.Errorf("Offset not set at correct position; got %d; want %d", vr.src.pos(), tc.offset)
+					}
+				})
+				t.Run("ReadBytes", func(t *testing.T) {


Optional: Lots of levels of subtests makes the code difficult to read because of the extra indentation and very long functions. Consider making these subtests separate test functions instead.

matthewdale · 2025-07-16T18:34:18Z

bson/value_reader_test.go

+	})
+
+	// This test is too complicated to try to abstract using subtests.
+	t.Run("ReadBytes & Skip (streaming)", func(t *testing.T) {


Optional: Consider making this a separate test function to make the test code easier to read and run.

matthewdale · 2025-07-16T18:39:33Z

bson/value_reader_test.go

 				(*valueReader).ReadString,
 				"",
 				io.ErrUnexpectedEOF,
+				io.EOF,
 				TypeString,


Is this the only case where the error differs between the streaming and buffered solutions?

Optional: Consider updating the logic to always return io.EOF. That should allow removing streamingErr and bufferedError from the test cases.

matthewdale · 2025-07-16T18:42:53Z

bson/buffered_value_reader.go

+// bufferedValueReader implements the low-level byteSrc interface by reading
+// directly from an in-memory byte slice. It provides efficient, zero-copy
+// access for parsing BSON when the entire document is buffered in memory.
+type bufferedValueReader struct {


Optional: This isn't a ValueReader, so the name is a bit confusing. Consider a name like bufferedByteSrc.

matthewdale · 2025-07-16T18:43:55Z

bson/streaming_value_reader.go

+//
+// Note: this approach trades memory usage for extra buffering and reader calls,
+// so it is less performanted than the in-memory bufferedValueReader.
+type streamingValueReader struct {


Optional: This isn't a ValueReader, so the name is a bit confusing. Consider a name like streamingByteSrc.

matthewdale · 2025-07-16T18:44:37Z

bson/value_reader.go

@@ -16,6 +16,38 @@ import (
 	"sync"
 )

+type valueReaderByteSrc interface {


Optional: Consider a more concise name like byteSrc.

matthewdale · 2025-07-16T18:46:45Z

bson/value_reader.go

+	vr.src = &bufferedValueReader{}
+	vr.src.(*bufferedValueReader).buf = b
+	vr.src.(*bufferedValueReader).offset = 0


Optional: Suggested simplification.

Suggested change

vr.src = &bufferedValueReader{}

vr.src.(*bufferedValueReader).buf = b

vr.src.(*bufferedValueReader).offset = 0

vr.src = &bufferedValueReader{

buf: b,

offset: 0,

}

qingyang-hu

👍

mongodb-drivers-pr-bot bot added the priority-3-low Low Priority PR for Review label Jul 7, 2025

prestonvasquez added priority-1-high High Priority PR for Review and removed priority-3-low Low Priority PR for Review labels Jul 8, 2025

prestonvasquez requested review from matthewdale and qingyang-hu July 8, 2025 01:56

prestonvasquez marked this pull request as ready for review July 8, 2025 01:56

prestonvasquez requested a review from a team as a code owner July 8, 2025 01:56

qingyang-hu reviewed Jul 10, 2025

View reviewed changes

matthewdale reviewed Jul 11, 2025

View reviewed changes

prestonvasquez added 19 commits July 11, 2025 11:26

Add valueReaderByteSrc interface

2ce17b5

Add bufferedValueReader valueReaderByteSrc implementation

2adb7cc

Add streamingValueReader valueReaderBytSrc implementation

1a436da

Rename newDocumentReader to newBufferedDocumentReader

5e1fad9

Reorganize newBufferedDocumentReader to use bufferedValueReader

8f960b0

Reorganize NewDocumentReader to use streamingValueReader

36d3c18

Update (*valueReader).pop() to support bVR + streaming

031bf94

Update (*valueReader).readValueBytes() to support bVR + streaming

54d08c4

Update (*valueReader).Skip() to support bVR + streaming

26765bd

Update (*valueReader).ReadArray() to support bVR + streaming

8167d47

Update (*valueReader).ReadBinary() to support bVR + streaming

02d38df

Add comment to (*valueReader).ReadBoolean()

50a265a

Update (*valueReader).ReadDocument() to support bVR + streaming

4bf9f61

Update (*valueReader).ReadCodeWithScope() to support bVR + streaming

828593c

Update (*valueReader).ReadDBPointer() to support bVR + streaming

96b5eb8

Update (*valueReader).ReadDateTime() to support bVR + streaming

0861119

Update (*valueReader).ReadDecimal128() to support bVR + streaming

443fb81

Add comment to (*valueReader).ReadDouble()

942e52e

Update (*valueReader).ReadInt32() to support bVR + streaming

f934c70

prestonvasquez added 21 commits July 11, 2025 12:34

Update (*valueReader).ReadInt64() to support bVR + streaming

c3ba283

Update (*valueReader).ReadJavascript() to support bVR + streaming

1a3d33d

Update (*valueReader).ReadObjectID() to support bVR + streaming

b7d0042

Add comment to (*valueReader).Read(MinKey|MaxKey|Null)()

0cbacc1

Add comment to (*valueReader).ReadRegex()

7d16cf8

Update (*valueReader).ReadString() to support bVR + streaming

a497774

Update (*valueReader).ReadSymbol() to support bVR + streaming

dfe5502

Add comment to (*valueReader).ReadTimestamp()

062adeb

Add comment to (*valueReader).ReadUndefined()

83a47d5

Update (*valueReader).ReadTimestamp() to support bVR + streaming

5b570de

Update (*valueReader).ReadElement() to support bVR + streaming

403445f

Update (*valueReader).ReadValue() to support bVR + streaming

dd709d1

Update (*valueReader).readValue() to support bVR + streaming

dea88a6

Update (*valueReader).readCString() to support bVR + streaming

2e8be2a

Update (*valueReader).readString() to support bVR + streaming

0036af6

Update (*valueReader).peekLength() to support bVR + streaming

f62a9de

Update (*valueReader).readLength() to support bVR + streaming

0ba5775

Update (*valueReader).read(i32|u32|i64|u64)() to support bVR + streaming

296ff25

Remove read and appendBytes

5a12f54

Update newValueReader to use bVR

ef6b024

Update tests to support bRV + streaming

620eb5a

prestonvasquez force-pushed the GODRIVER-3587 branch from 27e063b to 620eb5a Compare July 11, 2025 20:37

prestonvasquez requested review from matthewdale and qingyang-hu July 11, 2025 20:39

matthewdale requested changes Jul 11, 2025

View reviewed changes

Update bson/value_reader.go

7e44186

Co-authored-by: Matt Dale <[email protected]>

prestonvasquez requested a review from matthewdale July 14, 2025 23:03

Extend valueReader tests for both streaming and buffered

383f64d

matthewdale approved these changes Jul 16, 2025

View reviewed changes

qingyang-hu approved these changes Jul 17, 2025

View reviewed changes

		if !errors.Is(err, io.EOF) {
		t.Errorf("Expected io.ErrUnexpectedEOF with document length too small. got %v; want %v", err, io.EOF)

GODRIVER-3587 Use raw bytes in valueReader #2120

Are you sure you want to change the base?

GODRIVER-3587 Use raw bytes in valueReader #2120

Uh oh!

Conversation

prestonvasquez commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarking

Background & Motivation

Uh oh!

mongodb-drivers-pr-bot bot commented Jul 7, 2025

API Change Report

Uh oh!

prestonvasquez commented Jul 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewdale left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prestonvasquez Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewdale left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qingyang-hu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

prestonvasquez commented Jul 7, 2025 •

edited

Loading

prestonvasquez Jul 14, 2025 •

edited

Loading