-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Validate the memory consumption in SPM created by multi level merge #17029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
List of failing tests
Possible cause?There are two problem discovered while working on this PR. First, we don't reserve memory for in-memory streams (since we don't know the batch size without polling it) when we build Second, it seems like the actual memory usage reported by local memory pool exceeds the size of precomputed There are 6 spill files to merge, and we reserved
@2010YOUY01 @rluvaton |
Note: sort spill remaining in memory batch while row_hash does not, also sort account for memory used by the SortPerservingMergeStream which row_hash does not. How much is the difference for the fuzz tests I added that check memory constrained envs? as it only tests couple of simple columns that are easier to reason about. The kernels that are used in the sort stream might over estimate and also note that even if you request X capacity you might get more than that. |
I saw that
I think this is it |
If it fails, I think this approach will make the debugging very painful. I have an alternative idea to make this validation more fine-grained: Though this approach is less comprehensive, and can be a bit hacky when implementing (to directly extend operator for this check), but it can make trouble-shooting much easier. |
Update: I took alternative approach similar to what @2010YOUY01 suggested.
I switched back to using There is a slight discrepancy due to minor vector allocations, so I added a margin to the check. Fortunately, in most cases, the validation passes. However, for external sorting with string views, the validation currently fails, so further investigation is needed. |
I also did some additional debugging to understand why SortPreservingMergeStream ends up using more memory than the pre-reserved amount. The root cause I identified is as follows: Except for slight discrepancies due to vector allocation overhead, I found several key sources of memory underestimation:
When performing SPM (SortPreservingMerge) over both spill files and in-memory streams, we only reserve memory using
The estimation was based on a fixed 2× multiplier, without considering difference in sort key & sort payload columns and data type. In reality, this varies significantly and the current logic often under(/over)estimates. This is a known issue(#14748) and I’m actively working on it.
Looking at the implementation, SPM can buffer both the previous and current (cursor, batch) for each stream simultaneously.
That means, in the worst case, SPM can use up to 2 × |
5ec8edd
to
68cffc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this approach is a good idea.
I think this PR is ready, however I think we should work on fixing the failures first, then come back to this PR.
@@ -54,8 +54,13 @@ use futures::{FutureExt as _, Stream}; | |||
struct SpillReaderStream { | |||
schema: SchemaRef, | |||
state: SpillReaderStreamState, | |||
/// how much memory the largest memory batch is taking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to also explain: this is used for validation, and link to the multi-level merge's doc for the background.
> max_record_batch_memory + MEMORY_MARGIN | ||
{ | ||
return Poll::Ready(Some(Err( | ||
DataFusionError::ResourcesExhausted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use resources_err!(...)
macro here
For point 1: I vaguely remember in multi level merge, there is a logic to re-spill in-memory batches before the final merge, so that we don't have to special handlings for the mixed in-mem + spills case 🤔 If I’m not remembering it correctly, or we have missed some edges cases, we should do it (before the final merge, spill all in-mems first) for simplicity now. For point 2: I was expecting this should better be done after #15380, but it seems this optimization got stuck, I'll look into this issue in the next few days. |
After taking another look, it seems that the in-mem + spill case only happens in the first-round merge. After that, everything gets spilled. So while it's true that this case may use more memory than the reservation, it doesn't seem to be the major case, and I’ll hold off on addressing it for now.
I’ve opened a new PR to address it. Would appreciate it if you could take a look :) Besides that, just as a side note: I’m currently looking into a failing test case in this PR (memory validation). It’s related to |
Which issue does this PR close?
SortPreservingMergeStream
#16909 .Rationale for this change
In multi-level merge, we reserve estimated memory need for merging sorted spill files first, and bypass global memory pool when creating
SortPreservingMergeStream
(shortly SPM). The purpose of it is to ensure that we can finishSPM
step without lacking memory by keeping worst case memory reservation til SPM ends.grow merge_reservation based on max batch memory per spill file
datafusion/datafusion/physical-plan/src/sorts/multi_level_merge.rs
Lines 256 to 268 in 66d6995
bypass global buffer pool (use unbounded memory pool)
datafusion/datafusion/physical-plan/src/sorts/multi_level_merge.rs
Lines 326 to 336 in 66d6995
Since we use
UnboundedMemoryPool
as a trick, we don't validate whether thismemory_reservation
is the actual upper limit whenSPM
step for multi level merge. Therefore, we need to validate the memory consumption in SPM does not exceed the size of memory_reservation.What changes are included in this PR?
This PR add check to
SpillReadStream
so that whenever a spill stream is polled, the memory size of the batch being read does not exceedmax_record_batch_memory
+ margin. This allows us to detect cases where we made an incorrect (underestimated) memory reservation — for example, when the batch consumes more memory after the write-read cycle than originally expected.This PR creates a separateGreedyMemoryPool
size ofmemory_reservation
instead of usingUnboundedMemoryPool
when merging spill files (and in-memory streams) on multi-level merge.Are these changes tested?
Yes, and following tests related to spilling fail 😢
Maybe our previous worst-case memory estimation was wrong, but don't understand why at this point. We need more investigation here. I'll put more details in comments.
Are there any user-facing changes?