Skip to content

Commit f10deb6

Browse files
authored
Add Tuning Guide for small data / short queries (#17040)
1 parent 4c36226 commit f10deb6

File tree

2 files changed

+60
-0
lines changed

2 files changed

+60
-0
lines changed

dev/update_config_docs.sh

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,38 @@ EOF
119119
echo "Running CLI and inserting runtime config docs table"
120120
$PRINT_RUNTIME_CONFIG_DOCS_COMMAND >> "$TARGET_FILE"
121121

122+
cat <<'EOF' >> "$TARGET_FILE"
123+
124+
# Tuning Guide
125+
126+
## Short Queries
127+
128+
By default DataFusion will attempt to maximize parallelism and use all cores --
129+
For example, if you have 32 cores, each plan will split the data into 32
130+
partitions. However, if your data is small, the overhead of splitting the data
131+
to enable parallelization can dominate the actual computation.
132+
133+
You can find out how many cores are being used via the [`EXPLAIN`] command and look
134+
at the number of partitions in the plan.
135+
136+
[`EXPLAIN`]: sql/explain.md
137+
138+
The `datafusion.optimizer.repartition_file_min_size` option controls the minimum file size the
139+
[`ListingTable`] provider will attempt to repartition. However, this
140+
does not apply to user defined data sources and only works when DataFusion has accurate statistics.
141+
142+
If you know your data is small, you can set the `datafusion.execution.target_partitions`
143+
option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less
144+
than 1MB), we recommend setting `target_partitions` to 1 to avoid repartitioning altogether.
145+
146+
```sql
147+
SET datafusion.execution.target_partitions = '1';
148+
```
149+
150+
[`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
151+
152+
EOF
153+
122154

123155
echo "Running prettier"
124156
npx [email protected] --write "$TARGET_FILE"

docs/source/user-guide/configs.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,3 +192,31 @@ The following runtime configuration settings are available:
192192
| datafusion.runtime.max_temp_directory_size | 100G | Maximum temporary file directory size. Supports suffixes K (kilobytes), M (megabytes), and G (gigabytes). Example: '2G' for 2 gigabytes. |
193193
| datafusion.runtime.memory_limit | NULL | Maximum memory limit for query execution. Supports suffixes K (kilobytes), M (megabytes), and G (gigabytes). Example: '2G' for 2 gigabytes. |
194194
| datafusion.runtime.temp_directory | NULL | The path to the temporary file directory. |
195+
196+
# Tuning Guide
197+
198+
## Short Queries
199+
200+
By default DataFusion will attempt to maximize parallelism and use all cores --
201+
For example, if you have 32 cores, each plan will split the data into 32
202+
partitions. However, if your data is small, the overhead of splitting the data
203+
to enable parallelization can dominate the actual computation.
204+
205+
You can find out how many cores are being used via the [`EXPLAIN`] command and look
206+
at the number of partitions in the plan.
207+
208+
[`explain`]: sql/explain.md
209+
210+
The `datafusion.optimizer.repartition_file_min_size` option controls the minimum file size the
211+
[`ListingTable`] provider will attempt to repartition. However, this
212+
does not apply to user defined data sources and only works when DataFusion has accurate statistics.
213+
214+
If you know your data is small, you can set the `datafusion.execution.target_partitions`
215+
option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less
216+
than 1MB), we recommend setting `target_partitions` to 1 to avoid repartitioning altogether.
217+
218+
```sql
219+
SET datafusion.execution.target_partitions = '1';
220+
```
221+
222+
[`listingtable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html

0 commit comments

Comments
 (0)