-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Description
DataFusion 56.1.0 includes a new predicate cache
We tried hard to include a switch to disable the cache to prevent regressions, but apparently it doesn't always work in all cases.
@nuno-faria reports:
I found a potential performance regression with
parquet 56.1.0
. Now more data pages will be fetched if their size is less than the execution batch size. For example:
use datafusion::error::Result;
use datafusion::prelude::{ParquetReadOptions, SessionConfig, SessionContext};
#[tokio::main]
async fn main() -> Result<()> {
let config = SessionConfig::new().with_target_partitions(1);
let ctx = SessionContext::new_with_config(config);
ctx.sql("set datafusion.execution.parquet.pushdown_filters = true")
.await?
.collect()
.await?;
ctx.sql(
"
copy (
select i as k
from generate_series(1, 1000000) as t(i)
order by k
) to 't.parquet'
options (MAX_ROW_GROUP_SIZE 100000, DATA_PAGE_ROW_COUNT_LIMIT 1000, WRITE_BATCH_SIZE 1000, DICTIONARY_ENABLED FALSE);",
)
.await?
.collect()
.await?;
ctx.register_parquet("t", "t.parquet", ParquetReadOptions::new())
.await?;
ctx.sql("explain analyze select k from t where k = 123456")
.await?
.show()
.await?;
Ok(())
}
With parquet 56.0.0
:
metrics=[..., bytes_scanned=1273, ...]
# some debug info showing that a single page is retrieved
total=1273
ranges=[132974..134247]
With parquet 56.1.0
:
metrics=[..., bytes_scanned=9929, ...]
# some debug info showing that multiple pages are retrieved
total=9929
ranges=[125400..126482, 126482..127564, 127564..128646, 128646..129728, 129728..130810, 130810..131892, 131892..132974, 132974..134247, 134247..135329]
I think this is a consequence of apache/arrow-rs#7850, more specifically https://github.com/apache/arrow-rs/blame/0c7cb2ac3f3132216a08fd557f9b1edc7f90060f/parquet/src/arrow/arrow_reader/selection.rs#L445.
Originally posted by @nuno-faria in #17275 (comment)
Metadata
Metadata
Assignees
Labels
No labels