-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Change default prefetch_hint to 512Kb to reduce number of object store requests when reading parquet files #18160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…default (set metadata_size_hint)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving on my end but let's wait to see what one more reviewer thinks, I may not be representing all viewpoints on this.
pub metadata_size_hint: Option<usize>, default = None | ||
/// Default setting to 512 KB, which should be sufficient for most parquet files, | ||
/// it can reduce one I/O operation per parquet file. If the metadata is larger than | ||
/// the hint, two reads will still be performed. | ||
pub metadata_size_hint: Option<usize>, default = Some(512 * 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW having some prefetch on as a default makes a ton of sense to me. I'd like to run the benchmarks to make sure it doesn't have a big impact, I'd guess no positive or negative impact since benchmarks run against local disc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW having some prefetch on as a default makes a ton of sense to me. I'd like to run the benchmarks to make sure it doesn't have a big impact, I'd guess no positive or negative impact since benchmarks run against local disc.
Thank you @adriangb for review, i agree, it should not affect local disk performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @alamb for double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will kick off benchmarks.
I think the potential downside of this approach is that it will make larger requests to objectstore / local disk by default and use slightly more memory for small files (it will always fetch / buffer 512K even if the actual footer is much smaller)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, i agree if we have many small files.
will kick off benchmarks.
I think the potential downside of this approach is that it will make larger requests to objectstore / local disk by default and use slightly more memory for small files (it will always fetch / buffer 512K even if the actual footer is much smaller)
🤖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change makes sense but that we should do a few more things (I can help with this)
- Add a note in the upgrade guide saying the behavior changed
- Add some end to end tests somewhere that show the actual object store calls when reading parquet (aka something similar to here #18112 (comment)) -- and then test with the defaults as well as when we change the prefetch_hint
🤖: Benchmark completed Details
|
Interesting, it improved a lot for some cases even for local. |
yeah that is unexpected -- let me rerun the benchmark and see if I can reproduce it locally |
🤖 |
🤖: Benchmark completed Details
|
🤔 looks like a rough day on the benchmark farm. The second time the results for Q22 are the same. |
FYI @BlakeOrth -- I think this should also have a significant effect on the number of object store requests made by default (should reduce the number of requests by one for each file) |
@alamb Yes, agreed this should be a positive performance improvement on most datasets when using high latency storage, especially since fetching the parquet footer followed by the parquet metadata is a strictly sequential operation for each file. The benchmark results here are a bit curious and look inconsistent (perhaps due to reasons out of everyone's control). However, I wouldn't be too surprised to see minor performance improvements from some local disk backed queries. The 8B fetch for the parquet footer is below pretty much any reasonable storage device's and file system's block size, so the local disk and filesystem are probably doing the same amount of work in either case, and this PR eliminates one extra call to disk and any internal runtime scheduling around managing that call. |
Co-authored-by: Daniël Heres <[email protected]>
…default (set metadata_size_hint)
Which issue does this PR close?
Rationale for this change
Reduce number of object store requests when reading parquet files by default (set metadata_size_hint)
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
No