Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 4, 2025

Which issue does this PR close?

Rationale for this change

I wrote up some guidance for running datafusion with "small" data in #17025 (comment) that I felt was worth capturing in the documentation

What changes are included in this PR?

  • Add information about tuning datafusion for small data / short queries to the configuration page
Screenshot 2025-08-04 at 3 18 26 PM

Are these changes tested?

By CI (and I tested them locally)

Are there any user-facing changes?

@github-actions github-actions bot added documentation Improvements or additions to documentation development-process Related to development process of DataFusion labels Aug 4, 2025
@alamb alamb changed the title Add Tuning Guide for small data / short queries Docs: Add Tuning Guide for small data / short queries Aug 4, 2025
@alamb alamb marked this pull request as ready for review August 4, 2025 19:19
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great explanation! Thanks.

This configuration is important to tune, but at the same time it's really challenging to find the optimal value. In some workloads on small dataset, the optimal number of partitions may also lie between 1 and the number of physical cores.

I've been thinking about something similar for memory-limited scenarios: if the memory budget is very tight, using a smaller target_partition and batch_size can actually lead to faster query completion.

I'm looking forward to seeing some ML magic help us find the optimal tuning.

@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2025

Great explanation! Thanks.

This configuration is important to tune, but at the same time it's really challenging to find the optimal value. In some workloads on small dataset, the optimal number of partitions may also lie between 1 and the number of physical cores.

I've been thinking about something similar for memory-limited scenarios: if the memory budget is very tight, using a smaller target_partition and batch_size can actually lead to faster query completion.

I'm looking forward to seeing some ML magic help us find the optimal tuning.

That is a good point

I think the classic industrial approach is to use cost models to predict the size of intermediates, and thus memory consumption. However, that comes with all the problems of cardinality estimation.

I think the "state of the art" approach these days is to do something dynamic -- like when the plan starts hitting memory pressure to reconfigure the plan / partitioning at that time and redistribute memory.

However, I don't know of any real world system that does this, at least not to great effect, and I think it would be very complicated to implement

Maybe we can start by add a note to the tuning guide that when memory budget is very tight, using fewer target partitions can be helpful.

Comment on lines +119 to +121
```sql
SET datafusion.execution.target_partitions = '1';
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some such cases, will have a try

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb

@xudong963 xudong963 merged commit f10deb6 into apache:main Aug 5, 2025
29 checks passed
@alamb alamb deleted the alamb/tuning_guide branch August 5, 2025 20:50
hknlof pushed a commit to hknlof/datafusion that referenced this pull request Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of DataFusion documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants