Skip to content

WIP: [Parquet] Add tests for IO/CPU access in parquet reader #7971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 21, 2025

Which issue does this PR close?

Rationale for this change

There is quite a bit of cleverness in parquet reader related to IO patterns. To ensure we don't introduce regressions in the existing code, I would like to add tests that cover the IO patterns of the Parquet Reader.

I eventually would like to revisit the "minimize IO at all costs" design of the parquet reader (for use cases where the file is local, for example) but to do that I think we need to better understand what the current reader does

What changes are included in this PR?

Add a new test:

  1. Creates a temporary parquet file with a known row group structure
  2. Reads data from that file using the Arrow Parquet Reader, recording the IO operations
  3. Asserts the expected IO patterns based on the read operations

This is done for both the sync and async readers.

Are these changes tested?

This is only tests

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 21, 2025
@alamb alamb force-pushed the alamb/parquet_io_test branch from 2c8b561 to c2535a3 Compare July 22, 2025 16:28
@alamb alamb force-pushed the alamb/parquet_io_test branch from ba073f0 to 741c0d2 Compare July 22, 2025 20:54
@alamb
Copy link
Contributor Author

alamb commented Jul 22, 2025

Update here is I am quite pleased with how the sync reader looks. Now I am working on sorting out how to test the async reader

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant