Docs: Add Tuning Guide for larger-than-memory queries (#17069)

2010YOUY01 · alamb · web-flow · commit 173989cc2fb5 · 2025-08-08T07:33:07.000-04:00
* tuning guide for out-of-core execution

* Update dev/update_config_docs.sh

Co-authored-by: Andrew Lamb &lt;andrew@nerdnetworks.org&gt;

* Update dev/update_config_docs.sh

Co-authored-by: Andrew Lamb &lt;andrew@nerdnetworks.org&gt;

* Update dev/update_config_docs.sh

Co-authored-by: Andrew Lamb &lt;andrew@nerdnetworks.org&gt;

* fix ci

---------

Co-authored-by: Andrew Lamb &lt;andrew@nerdnetworks.org&gt;
diff --git a/dev/update_config_docs.sh b/dev/update_config_docs.sh
@@ -149,6 +149,32 @@ SET datafusion.execution.target_partitions = '1';
 
 [`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
 
+## Memory-limited Queries
+
+When executing a memory-consuming query under a tight memory limit, DataFusion 
+will spill intermediate results to disk.
+
+When the [`FairSpillPool`] is used, memory is divided evenly among partitions. 
+The higher the value of `datafusion.execution.target_partitions`, the less memory 
+is allocated to each partition, and the out-of-core execution path may trigger 
+more frequently, possibly slowing down execution.
+
+Additionally, while spilling, data is read back in `datafusion.execution.batch_size` size batches.
+The larger this value, the fewer spilled sorted runs can be merged. Decreasing this setting
+can help reduce the number of subsequent spills required. 
+
+In conclusion, for queries under a very tight memory limit, it's recommended to
+set `target_partitions` and `batch_size` to smaller values.
+
+```sql
+-- Query still gets paralleized, but each partition will have more memory to use
+SET datafusion.execution.target_partitions = 4;
+-- Smaller than the default '8192', while still keep the benefit of vectorized execution
+SET datafusion.execution.batch_size = 1024;
+```
+
+[`FairSpillPool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html
+
 EOF
 
 
diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md
@@ -220,3 +220,29 @@ SET datafusion.execution.target_partitions = '1';
 ```
 
 [`listingtable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
+
+## Memory-limited Queries
+
+When executing a memory-consuming query under a tight memory limit, DataFusion
+will spill intermediate results to disk.
+
+When the [`FairSpillPool`] is used, memory is divided evenly among partitions.
+The higher the value of `datafusion.execution.target_partitions`, the less memory
+is allocated to each partition, and the out-of-core execution path may trigger
+more frequently, possibly slowing down execution.
+
+Additionally, while spilling, data is read back in `datafusion.execution.batch_size` size batches.
+The larger this value, the fewer spilled sorted runs can be merged. Decreasing this setting
+can help reduce the number of subsequent spills required.
+
+In conclusion, for queries under a very tight memory limit, it's recommended to
+set `target_partitions` and `batch_size` to smaller values.
+
+```sql
+-- Query still gets paralleized, but each partition will have more memory to use
+SET datafusion.execution.target_partitions = 4;
+-- Smaller than the default '8192', while still keep the benefit of vectorized execution
+SET datafusion.execution.batch_size = 1024;
+```
+
+[`fairspillpool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html