Skip to content

Commit 173989c

Browse files
2010YOUY01alamb
andauthored
Docs: Add Tuning Guide for larger-than-memory queries (#17069)
* tuning guide for out-of-core execution * Update dev/update_config_docs.sh Co-authored-by: Andrew Lamb <[email protected]> * Update dev/update_config_docs.sh Co-authored-by: Andrew Lamb <[email protected]> * Update dev/update_config_docs.sh Co-authored-by: Andrew Lamb <[email protected]> * fix ci --------- Co-authored-by: Andrew Lamb <[email protected]>
1 parent f0630fb commit 173989c

File tree

2 files changed

+52
-0
lines changed

2 files changed

+52
-0
lines changed

dev/update_config_docs.sh

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,32 @@ SET datafusion.execution.target_partitions = '1';
149149
150150
[`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
151151
152+
## Memory-limited Queries
153+
154+
When executing a memory-consuming query under a tight memory limit, DataFusion
155+
will spill intermediate results to disk.
156+
157+
When the [`FairSpillPool`] is used, memory is divided evenly among partitions.
158+
The higher the value of `datafusion.execution.target_partitions`, the less memory
159+
is allocated to each partition, and the out-of-core execution path may trigger
160+
more frequently, possibly slowing down execution.
161+
162+
Additionally, while spilling, data is read back in `datafusion.execution.batch_size` size batches.
163+
The larger this value, the fewer spilled sorted runs can be merged. Decreasing this setting
164+
can help reduce the number of subsequent spills required.
165+
166+
In conclusion, for queries under a very tight memory limit, it's recommended to
167+
set `target_partitions` and `batch_size` to smaller values.
168+
169+
```sql
170+
-- Query still gets paralleized, but each partition will have more memory to use
171+
SET datafusion.execution.target_partitions = 4;
172+
-- Smaller than the default '8192', while still keep the benefit of vectorized execution
173+
SET datafusion.execution.batch_size = 1024;
174+
```
175+
176+
[`FairSpillPool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html
177+
152178
EOF
153179

154180

docs/source/user-guide/configs.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -220,3 +220,29 @@ SET datafusion.execution.target_partitions = '1';
220220
```
221221

222222
[`listingtable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
223+
224+
## Memory-limited Queries
225+
226+
When executing a memory-consuming query under a tight memory limit, DataFusion
227+
will spill intermediate results to disk.
228+
229+
When the [`FairSpillPool`] is used, memory is divided evenly among partitions.
230+
The higher the value of `datafusion.execution.target_partitions`, the less memory
231+
is allocated to each partition, and the out-of-core execution path may trigger
232+
more frequently, possibly slowing down execution.
233+
234+
Additionally, while spilling, data is read back in `datafusion.execution.batch_size` size batches.
235+
The larger this value, the fewer spilled sorted runs can be merged. Decreasing this setting
236+
can help reduce the number of subsequent spills required.
237+
238+
In conclusion, for queries under a very tight memory limit, it's recommended to
239+
set `target_partitions` and `batch_size` to smaller values.
240+
241+
```sql
242+
-- Query still gets paralleized, but each partition will have more memory to use
243+
SET datafusion.execution.target_partitions = 4;
244+
-- Smaller than the default '8192', while still keep the benefit of vectorized execution
245+
SET datafusion.execution.batch_size = 1024;
246+
```
247+
248+
[`fairspillpool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html

0 commit comments

Comments
 (0)