Skip to content

[Performance] Performance degrade when query for tons of small query #17025

@Ma1oneZhang

Description

@Ma1oneZhang

DataFusion Performance Degradation with Small Batches

We've identified a significant performance degradation in DataFusion when querying small data batches or data sources that return no data. Our investigation has pinpointed two primary causes for this issue:

  1. Excessive Metrics Overhead: DataFusion's extensive metrics collection adds considerable overhead, particularly when the query processing time is minimal. The time spent recording and managing these metrics becomes a dominant factor, disproportionately impacting performance on small tasks.

  2. Fragmented Data Blocks: Processing numerous small, fragmented data blocks (even empty ones) leads to inefficiencies. The overhead of managing these individual fragments, rather than the data itself, consumes valuable processing time, exacerbating the performance bottleneck.

Proposed Solution

To address this problem, i believe that add a new configuration option that allows users to disable metrics collection. By setting an environment variable or a configuration flag, users can choose to bypass the metrics system entirely. This change will significantly reduce the overhead associated with metrics, leading to improved performance for workloads involving small data batches or empty data sources.

I believe this solution offers a practical way to balance the need for performance with the utility of having detailed metrics, giving users the flexibility to optimize DataFusion for their specific use cases.

To Reproduce

Create some aggregations SQL, and just run it in an empty dataset.

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions