-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Docs: Add Tuning Guide for small data / short queries #17040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great explanation! Thanks.
This configuration is important to tune, but at the same time it's really challenging to find the optimal value. In some workloads on small dataset, the optimal number of partitions may also lie between 1 and the number of physical cores.
I've been thinking about something similar for memory-limited scenarios: if the memory budget is very tight, using a smaller target_partition
and batch_size
can actually lead to faster query completion.
I'm looking forward to seeing some ML magic help us find the optimal tuning.
That is a good point I think the classic industrial approach is to use cost models to predict the size of intermediates, and thus memory consumption. However, that comes with all the problems of cardinality estimation. I think the "state of the art" approach these days is to do something dynamic -- like when the plan starts hitting memory pressure to reconfigure the plan / partitioning at that time and redistribute memory. However, I don't know of any real world system that does this, at least not to great effect, and I think it would be very complicated to implement Maybe we can start by add a note to the tuning guide that when memory budget is very tight, using fewer target partitions can be helpful. |
```sql | ||
SET datafusion.execution.target_partitions = '1'; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some such cases, will have a try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb
Which issue does this PR close?
Rationale for this change
I wrote up some guidance for running datafusion with "small" data in #17025 (comment) that I felt was worth capturing in the documentation
What changes are included in this PR?
Are these changes tested?
By CI (and I tested them locally)
Are there any user-facing changes?