Skip to content

Potential performance regression with parquet 56.1.0 / data ranges #17575

@alamb

Description

@alamb

DataFusion 56.1.0 includes a new predicate cache

We tried hard to include a switch to disable the cache to prevent regressions, but apparently it doesn't always work in all cases.

@nuno-faria reports:

I found a potential performance regression with parquet 56.1.0. Now more data pages will be fetched if their size is less than the execution batch size. For example:

use datafusion::error::Result;
use datafusion::prelude::{ParquetReadOptions, SessionConfig, SessionContext};

#[tokio::main]
async fn main() -> Result<()> {
    let config = SessionConfig::new().with_target_partitions(1);
    let ctx = SessionContext::new_with_config(config);
    ctx.sql("set datafusion.execution.parquet.pushdown_filters = true")
        .await?
        .collect()
        .await?;

    ctx.sql(
        "
        copy (
            select i as k
            from generate_series(1, 1000000) as t(i)
            order by k
        ) to 't.parquet'
        options (MAX_ROW_GROUP_SIZE 100000, DATA_PAGE_ROW_COUNT_LIMIT 1000, WRITE_BATCH_SIZE 1000, DICTIONARY_ENABLED FALSE);",
    )
    .await?
    .collect()
    .await?;

    ctx.register_parquet("t", "t.parquet", ParquetReadOptions::new())
        .await?;

    ctx.sql("explain analyze select k from t where k = 123456")
        .await?
        .show()
        .await?;

    Ok(())
}

With parquet 56.0.0:

metrics=[..., bytes_scanned=1273, ...]

# some debug info showing that a single page is retrieved
total=1273
ranges=[132974..134247]

With parquet 56.1.0:

metrics=[..., bytes_scanned=9929, ...]

# some debug info showing that multiple pages are retrieved
total=9929
ranges=[125400..126482, 126482..127564, 127564..128646, 128646..129728, 129728..130810, 130810..131892, 131892..132974, 132974..134247, 134247..135329]

I think this is a consequence of apache/arrow-rs#7850, more specifically https://github.com/apache/arrow-rs/blame/0c7cb2ac3f3132216a08fd557f9b1edc7f90060f/parquet/src/arrow/arrow_reader/selection.rs#L445.

Originally posted by @nuno-faria in #17275 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions