-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
DataFusion Performance Degradation with Small Batches
We've identified a significant performance degradation in DataFusion when querying small data batches or data sources that return no data. Our investigation has pinpointed two primary causes for this issue:
-
Excessive Metrics Overhead: DataFusion's extensive metrics collection adds considerable overhead, particularly when the query processing time is minimal. The time spent recording and managing these metrics becomes a dominant factor, disproportionately impacting performance on small tasks.
-
Fragmented Data Blocks: Processing numerous small, fragmented data blocks (even empty ones) leads to inefficiencies. The overhead of managing these individual fragments, rather than the data itself, consumes valuable processing time, exacerbating the performance bottleneck.
Proposed Solution
To address this problem, i believe that add a new configuration option that allows users to disable metrics collection. By setting an environment variable or a configuration flag, users can choose to bypass the metrics system entirely. This change will significantly reduce the overhead associated with metrics, leading to improved performance for workloads involving small data batches or empty data sources.
I believe this solution offers a practical way to balance the need for performance with the utility of having detailed metrics, giving users the flexibility to optimize DataFusion for their specific use cases.
To Reproduce
Create some aggregations SQL, and just run it in an empty dataset.
Expected behavior
No response
Additional context
No response