diff --git a/dev/update_config_docs.sh b/dev/update_config_docs.sh index 3052f9b80327..af8ab04f3c4f 100755 --- a/dev/update_config_docs.sh +++ b/dev/update_config_docs.sh @@ -149,6 +149,32 @@ SET datafusion.execution.target_partitions = '1'; [`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html +## Memory-limited Queries + +When executing a memory-consuming query under a tight memory limit, DataFusion +will spill intermediate results to disk. + +When the [`FairSpillPool`] is used, memory is divided evenly among partitions. +The higher the value of `datafusion.execution.target_partitions`, the less memory +is allocated to each partition, and the out-of-core execution path may trigger +more frequently, possibly slowing down execution. + +Additionally, while spilling, data is read back in `datafusion.execution.batch_size` size batches. +The larger this value, the fewer spilled sorted runs can be merged. Decreasing this setting +can help reduce the number of subsequent spills required. + +In conclusion, for queries under a very tight memory limit, it's recommended to +set `target_partitions` and `batch_size` to smaller values. + +```sql +-- Query still gets paralleized, but each partition will have more memory to use +SET datafusion.execution.target_partitions = 4; +-- Smaller than the default '8192', while still keep the benefit of vectorized execution +SET datafusion.execution.batch_size = 1024; +``` + +[`FairSpillPool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html + EOF diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md index 9190452c0aab..504d62829fb7 100644 --- a/docs/source/user-guide/configs.md +++ b/docs/source/user-guide/configs.md @@ -221,3 +221,29 @@ SET datafusion.execution.target_partitions = '1'; ``` [`listingtable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html + +## Memory-limited Queries + +When executing a memory-consuming query under a tight memory limit, DataFusion +will spill intermediate results to disk. + +When the [`FairSpillPool`] is used, memory is divided evenly among partitions. +The higher the value of `datafusion.execution.target_partitions`, the less memory +is allocated to each partition, and the out-of-core execution path may trigger +more frequently, possibly slowing down execution. + +Additionally, while spilling, data is read back in `datafusion.execution.batch_size` size batches. +The larger this value, the fewer spilled sorted runs can be merged. Decreasing this setting +can help reduce the number of subsequent spills required. + +In conclusion, for queries under a very tight memory limit, it's recommended to +set `target_partitions` and `batch_size` to smaller values. + +```sql +-- Query still gets paralleized, but each partition will have more memory to use +SET datafusion.execution.target_partitions = 4; +-- Smaller than the default '8192', while still keep the benefit of vectorized execution +SET datafusion.execution.batch_size = 1024; +``` + +[`fairspillpool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html