Docs: Add Tuning Guide for small data / short queries #17040

alamb · 2025-08-04T19:18:41Z

Which issue does this PR close?

part of [Epic] Better / Improved Documentation, Tutorials and Examples #7013

Rationale for this change

I wrote up some guidance for running datafusion with "small" data in #17025 (comment) that I felt was worth capturing in the documentation

What changes are included in this PR?

Add information about tuning datafusion for small data / short queries to the configuration page

Are these changes tested?

By CI (and I tested them locally)

Are there any user-facing changes?

2010YOUY01

Great explanation! Thanks.

This configuration is important to tune, but at the same time it's really challenging to find the optimal value. In some workloads on small dataset, the optimal number of partitions may also lie between 1 and the number of physical cores.

I've been thinking about something similar for memory-limited scenarios: if the memory budget is very tight, using a smaller target_partition and batch_size can actually lead to faster query completion.

I'm looking forward to seeing some ML magic help us find the optimal tuning.

alamb · 2025-08-05T10:15:44Z

Great explanation! Thanks.

This configuration is important to tune, but at the same time it's really challenging to find the optimal value. In some workloads on small dataset, the optimal number of partitions may also lie between 1 and the number of physical cores.

I've been thinking about something similar for memory-limited scenarios: if the memory budget is very tight, using a smaller target_partition and batch_size can actually lead to faster query completion.

I'm looking forward to seeing some ML magic help us find the optimal tuning.

That is a good point

I think the classic industrial approach is to use cost models to predict the size of intermediates, and thus memory consumption. However, that comes with all the problems of cardinality estimation.

I think the "state of the art" approach these days is to do something dynamic -- like when the plan starts hitting memory pressure to reconfigure the plan / partitioning at that time and redistribute memory.

However, I don't know of any real world system that does this, at least not to great effect, and I think it would be very complicated to implement

Maybe we can start by add a note to the tuning guide that when memory budget is very tight, using fewer target partitions can be helpful.

xudong963 · 2025-08-05T11:28:20Z

dev/update_config_docs.sh

+```sql
+SET datafusion.execution.target_partitions = '1';
+```


We have some such cases, will have a try

xudong963

Thank you @alamb

Add Tuning Guide for small data / short queries

e6c0edc

github-actions bot added documentation Improvements or additions to documentation development-process Related to development process of DataFusion labels Aug 4, 2025

This was referenced Aug 4, 2025

Docs: Add Examples to Config Options page #17039

Merged

Docs: Update the crate configuration / build settings page #17038

Merged

alamb changed the title ~~Add Tuning Guide for small data / short queries~~ Docs: Add Tuning Guide for small data / short queries Aug 4, 2025

alamb mentioned this pull request Aug 4, 2025

[Performance] Performance degrade when query for tons of small query #17025

Closed

alamb marked this pull request as ready for review August 4, 2025 19:19

2010YOUY01 approved these changes Aug 5, 2025

View reviewed changes

xudong963 reviewed Aug 5, 2025

View reviewed changes

xudong963 approved these changes Aug 5, 2025

View reviewed changes

xudong963 merged commit f10deb6 into apache:main Aug 5, 2025
29 checks passed

alamb deleted the alamb/tuning_guide branch August 5, 2025 20:50

2010YOUY01 mentioned this pull request Aug 7, 2025

Docs: Add Tuning Guide for larger-than-memory queries #17069

Merged

hknlof pushed a commit to hknlof/datafusion that referenced this pull request Aug 20, 2025

Add Tuning Guide for small data / short queries (apache#17040)

75e9192

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Docs: Add Tuning Guide for small data / short queries #17040

Docs: Add Tuning Guide for small data / short queries #17040

Uh oh!

alamb commented Aug 4, 2025

Uh oh!

2010YOUY01 left a comment

Uh oh!

alamb commented Aug 5, 2025

Uh oh!

xudong963 Aug 5, 2025

Uh oh!

xudong963 left a comment

Uh oh!

Uh oh!

Uh oh!

Docs: Add Tuning Guide for small data / short queries #17040

Docs: Add Tuning Guide for small data / short queries #17040

Uh oh!

Conversation

alamb commented Aug 4, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 5, 2025

Uh oh!

xudong963 Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

xudong963 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!