Skip to content

Conversation

123789456ye
Copy link

@123789456ye 123789456ye commented Sep 10, 2025

Which issue does this PR close?

Rationale for this change

We have a thought of introducing a page-level cache long ago. Though previously we can only read the whole rowgroup.
Now we can implement it. The predicate part has been implemented, and output part is left for this PR.

What changes are included in this PR?

The root part is to introduce page-level cache mechanism in decode_page in impl RowGroupReader for SerializedRowGroupReader.
Only effective for async readers. Nearly zero overhead for sync readers.
The cache mechanism is using moka crate. This part is plug-in if we need to change.

Are these changes tested?

I run cargo test and cargo test --features=arrow,async.
All tests pass.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 10, 2025
@123789456ye
Copy link
Author

123789456ye commented Sep 10, 2025

I set default cache capacity to 100, which means we can cache 100 pages, using slightly more than 100 MB memory.

I run cargo bench --bench arrow_reader_clickbench --features "arrow async" "async" -- --nocapture --measurement-time 10 --save-baseline baseline for baseline,
and cargo bench --bench arrow_reader_clickbench --features "arrow async" "async" -- --nocapture --measurement-time 10 --baseline baseline for bench.

Run in local environment, Ubuntu 22.04 LTS in WSL2.

Results as followes. All time in table is median time.

And we think it is a very delightful result.

Query Baseline Time Current Time Change
Q1 1.8 ms 1.9 ms +1.0%
Q10 17.5 ms 10.5 ms -40.1%
Q11 20.0 ms 13.6 ms -31.9%
Q12 27.8 ms 16.8 ms -39.4%
Q13 39.4 ms 28.0 ms -28.9%
Q14 36.3 ms 23.8 ms -34.3%
Q19 5.1 ms 4.7 ms -7.6%
Q20 95.8 ms 43.4 ms -54.7%
Q21 112.8 ms 52.8 ms -53.2%
Q22 191.3 ms 127.2 ms -33.5%
Q23 327.1 ms 259.7 ms -20.6%
Q24 34.7 ms 28.3 ms -18.4%
Q27 73.5 ms 35.6 ms -51.5%
Q28 77.5 ms 34.4 ms -55.6%
Q30 50.9 ms 40.3 ms -20.9%
Q36 96.6 ms 50.2 ms -48.0%
Q37 75.8 ms 46.0 ms -39.3%
Q38 30.8 ms 24.5 ms -20.7%
Q39 41.2 ms 26.2 ms -36.4%
Q40 45.2 ms 36.2 ms -20.1%
Q41 33.2 ms 29.7 ms -10.5%
Q42 11.6 ms 11.4 ms -2.2%

@123789456ye
Copy link
Author

123789456ye commented Sep 10, 2025

Basically we are using memory to exchange for bypassing decompressing and decoding.

The great result partly comes from the concurrency of bench test, that will read one file across multiple readers.

I have also tested to use a reader-level caching, but unfortunately all performance will regress. Therefore we can only maintain a global cache.

@alamb
Copy link
Contributor

alamb commented Sep 10, 2025

Thank you for this @123789456ye -- I have started the CI checks on this PR

Perhaps @XiangpengHao also has some time to review this

@@ -52,6 +52,7 @@ parquet-variant-compute = { workspace = true, optional = true }
object_store = { version = "0.12.0", default-features = false, optional = true }

bytes = { version = "1.1", default-features = false, features = ["std"] }
moka = { version = "0.12", default-features = false, features = ["sync"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we have tried to keep the dependnecy tree relatively small for paruqet -- this one seems to be a significant addition: https://crates.io/crates/moka/0.12.10/dependencies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than implement the cache directly in the parquet crate, I wonder if we could add a trait in the parquet crate and then users would provide implementations 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree that we should not introduce it. And this part is easy to change.
But I haven't consider not to implement cache. If we don't provide a default implementation, isn't it messy to write a custom implementation each time if we want to use it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than implement the cache directly in the parquet crate, I wonder if we could add a trait in the parquet crate and then users would provide implementations

For this part, you may review trait PageCacheStrategy in page_cache.rs and see if it meets your needs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -52,6 +52,7 @@ parquet-variant-compute = { workspace = true, optional = true }
object_store = { version = "0.12.0", default-features = false, optional = true }

bytes = { version = "1.1", default-features = false, features = ["std"] }
moka = { version = "0.12", default-features = false, features = ["sync"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than implement the cache directly in the parquet crate, I wonder if we could add a trait in the parquet crate and then users would provide implementations 🤔

@XiangpengHao
Copy link
Contributor

Hi @123789456ye -- just to clarify, is the idea that this cache is mainly for predicate cache or more of a general purpose Parquet page cache?

If it's for predicate cache, we already have fully decoded arrow data cache in #7850, which should take care of avoid extra IO and decoding.

If it's for cache across different queries, my sense is that OS page cache usually handles that pretty well. Storing decompressed Parquet page feels like a pretty specific design, so it might worth disucssing the trade-offs, e.g., overhead, complexity, whether user can do it by themselves etc.

@123789456ye
Copy link
Author

Thank you @XiangpengHao for your review.
The original motivation of this is to use it in getting from remote resource(e.g. object storage), therefore we need some cache.

If it's for predicate cache, we already have fully decoded arrow data cache in #7850, which should take care of avoid extra IO and decoding.

This is designed for a general reading purpose. And yes we have noticed that work. I should have remebered to split out output phase from pushdown phase. However I somehow missed in the levels of readers.
And when I write tests, I find out that current implementation will also influence pushdown phase. I am thinking of how to split them. (Or maybe no need to split?)

If it's for cache across different queries, my sense is that OS page cache usually handles that pretty well. Storing decompressed Parquet page feels like a pretty specific design, so it might worth disucssing the trade-offs, e.g., overhead, complexity, whether user can do it by themselves etc.

Of course these should be carefully discussed. Though I think OS page cache serves for different parts. IMO, the OS page cache should cache raw bytes(i.e. compressed pages), and this cache shall cache decompressed pages.

@XiangpengHao
Copy link
Contributor

Got it, thank you for clarifying @123789456ye !

In a Parquet → Arrow pipeline, I usually think of it in four steps:

device -> raw parquet bytes in memory  -> uncompressed bytes in memory -> Arrow

Each of the step can take a significant amount of time, and may warrant a cache. Personally, I'd lean towards keeping things flexible so that users can plug in the caching they need, rather than baking a specific policy directly into the parquet crate.

For example, @alamb is working on push decoder, which will make the step 1 very easy -- any end user can decide how/where to feed the required bytes.

Step 3 is a bit tricky because Parquet to Arrow is very non-trivial (but probably still doable).

Step 2 is what this PR tackles.

So my hope is that we can evolve the API in a direction where downstream users have the hooks (maybe traits?) to implement their own cache strategies, instead of locking in a particular approach inside the crate.

Hope this helps!

@ethe
Copy link
Contributor

ethe commented Sep 13, 2025

Each of the step can take a significant amount of time, and may warrant a cache.

This work is sponsored by Tonbo. Given the immutability of Parquet/Arrow, it would be very helpful in real-world projects if users could use caching to avoid as much computation (decompression, deserialization, etc.) and I/O as possible. That’s why we are looking for this feature. Unfortunately, for external users, the only cache level currently available is the raw bytes of a Parquet file.

I agree that the current implementation of arrow-rs lacks APIs for users to hook into caching. Do you think it would be meaningful to push forward a discussion or draft proposal for such an API?

@XiangpengHao
Copy link
Contributor

it would be very helpful in real-world projects if users could use caching to avoid as much computation (decompression, serialization, etc.) and I/O as possible.

💯, I totally agree, almost everyone wants a cache for these computation.

Do you think it would be meaningful to push forward a discussion or draft proposal for such an API?

Yea, that's my hope. While everyone wants a cache, they demand different policies depends on their data and query pattern; I think it's valuable if we can have a set of APIs that can easily allow users to plugin their own policies/caching mechanisms.

maybe @alamb also has some opinions on this

@123789456ye
Copy link
Author

123789456ye commented Sep 16, 2025

Remove the concrete implementations. Add some tests based on the visualization I/O.
Though the split out of predicate and output phase is not decided, and the test therefore show the result of both.

Currently the way of using page level cache is as followes:

  • First you should have a page strategy that impl PageCacheStrategy.
  • Then before any reader built, run ParquetContext::set_cache(page_cache: Option<Arc<dyn PageCacheStrategy>>) to setup global cache.

Maybe we can expand this cache to other stages and levels, but I think currently we can first push forward this.

@123789456ye 123789456ye requested a review from alamb September 16, 2025 11:22
@123789456ye
Copy link
Author

Re-request review. What do you think about the API design, or any other things?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parquet] Support page level cache for reading
4 participants