Skip to content

feat: Batch records before writing to sinks#20821

Open
salvacorts wants to merge 22 commits intomainfrom
salvacorts/buffered-sinks
Open

feat: Batch records before writing to sinks#20821
salvacorts wants to merge 22 commits intomainfrom
salvacorts/buffered-sinks

Conversation

@salvacorts
Copy link
Contributor

What this PR does / why we need it:

We noticed that in many queries for which the physical planning phase was taking long, most of the time was spent on waiting to write to sinks.

Schedulers create one task per pointerscan. Before this PR, each task would write N times to sinks.

We are adding record_batch_size: to control how many responses are merged together into a single response before writing to a sink, thus reducing the number of round trips. Note that this also applies to other tasks beside the metastore scan ones.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 16, 2026

💻 Deploy preview available (feat: Batch records before writing to sinks):

@salvacorts salvacorts force-pushed the salvacorts/buffered-sinks branch from 762962f to 95025fa Compare February 16, 2026 09:29
@salvacorts salvacorts force-pushed the salvacorts/buffered-sinks branch from 95025fa to 6bd1ddb Compare February 16, 2026 09:33
@salvacorts salvacorts force-pushed the salvacorts/buffered-sinks branch from 6bd1ddb to 3506e9a Compare February 16, 2026 09:35
@salvacorts salvacorts marked this pull request as ready for review February 16, 2026 09:55
@salvacorts salvacorts requested a review from a team as a code owner February 16, 2026 09:55
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a fair bit of complexity being introduced here by batching by number of records. We already have batch_size as a configuration option in the engine, have you tried using that instead of adding a new limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have copied over my comment from the original PR: #20763 (comment)

by batching by number of records

We don't batch by the number of records/rows but, but by their size (Bytes).

My hesitation around using the already existing batch_size (row count) is that (e.g.) 6k rows of metric records will be fairly small, whereas 6k rows of log records can be a lot and cause OOMs.

In my opinion bytes size-based batching fits all use cases plus ideally the writing to sinks logic should be agnostic of what's actually written.

Happy to change my mind tho.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hesitation around using the already existing batch_size (row count) is that (e.g.) 6k rows of metric records will be fairly small, whereas 6k rows of log records can be a lot and cause OOMs.

If 6k rows of log records can cause OOMs, our batch size shouldn't be 6k :) batch_size determines the largest record that a pipeline can produce, so it needs to be able to fit in memory. (Batching records by row count also helps performance as you'll have more contiguous regions of memory)

Based on that intent, do you think it makes sense to consistently use batch_size throughout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our batch size shouldn't be 6k

Fair, my point is that 6k may be fine (and beneficial) for some record types and too much for other. So reducing it reduces the benefit in the smaller record types.

Still I'm all up for simplicity here, and this is something we can revisit in the future if we need to. I changed it so we use the already existing BatchSize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single arrowagg instance can be used in drainPipeline instead of batch []arrow.RecordBatch, I think. arroagg has a Reset method, so it can be reset after flushing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments