Add Tuning Guide for small data / short queries (#17040)

alamb · web-flow · commit f10deb67a3d3 · 2025-08-05T19:29:02.000+08:00
diff --git a/dev/update_config_docs.sh b/dev/update_config_docs.sh
@@ -119,6 +119,38 @@ EOF
 echo "Running CLI and inserting runtime config docs table"
 $PRINT_RUNTIME_CONFIG_DOCS_COMMAND >> "$TARGET_FILE"
 
+cat <<'EOF' >> "$TARGET_FILE"
+
+# Tuning Guide
+
+## Short Queries
+
+By default DataFusion will attempt to maximize parallelism and use all cores --
+For example, if you have 32 cores, each plan will split the data into 32
+partitions. However, if your data is small, the overhead of splitting the data
+to enable parallelization can dominate the actual computation.
+
+You can find out how many cores are being used via the [`EXPLAIN`] command and look
+at the number of partitions in the plan.
+
+[`EXPLAIN`]: sql/explain.md
+
+The `datafusion.optimizer.repartition_file_min_size` option controls the minimum file size the
+[`ListingTable`] provider will attempt to repartition. However, this
+does not apply to user defined data sources and only works when DataFusion has accurate statistics.
+
+If you know your data is small, you can set the `datafusion.execution.target_partitions`
+option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less
+than 1MB), we recommend setting `target_partitions` to 1 to avoid repartitioning altogether.
+
+```sql
+SET datafusion.execution.target_partitions = '1';
+```
+
+[`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
+
+EOF
+
 
 echo "Running prettier"
 npx prettier@2.3.2 --write "$TARGET_FILE"
diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md
@@ -192,3 +192,31 @@ The following runtime configuration settings are available:
 | datafusion.runtime.max_temp_directory_size | 100G    | Maximum temporary file directory size. Supports suffixes K (kilobytes), M (megabytes), and G (gigabytes). Example: '2G' for 2 gigabytes.    |
 | datafusion.runtime.memory_limit            | NULL    | Maximum memory limit for query execution. Supports suffixes K (kilobytes), M (megabytes), and G (gigabytes). Example: '2G' for 2 gigabytes. |
 | datafusion.runtime.temp_directory          | NULL    | The path to the temporary file directory.                                                                                                   |
+
+# Tuning Guide
+
+## Short Queries
+
+By default DataFusion will attempt to maximize parallelism and use all cores --
+For example, if you have 32 cores, each plan will split the data into 32
+partitions. However, if your data is small, the overhead of splitting the data
+to enable parallelization can dominate the actual computation.
+
+You can find out how many cores are being used via the [`EXPLAIN`] command and look
+at the number of partitions in the plan.
+
+[`explain`]: sql/explain.md
+
+The `datafusion.optimizer.repartition_file_min_size` option controls the minimum file size the
+[`ListingTable`] provider will attempt to repartition. However, this
+does not apply to user defined data sources and only works when DataFusion has accurate statistics.
+
+If you know your data is small, you can set the `datafusion.execution.target_partitions`
+option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less
+than 1MB), we recommend setting `target_partitions` to 1 to avoid repartitioning altogether.
+
+```sql
+SET datafusion.execution.target_partitions = '1';
+```
+
+[`listingtable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html