Replies: 1 comment
-
|
I was able to reduce parallelism by tuning |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am writing a program that takes N parquet files (where N = 40). Each source parquet file is about ~6 to ~8 MB in size and are Zstd compressed. They are compacted/combined to produce a bigger sized parquet file (~220 to ~250 MB). It appears that we need as much as ~24 GB of memory to have a successful compaction. This gist https://gist.github.com/ndchandar/3900558ff719cefeb8b058e36a18f8be#file-parquet_rewriter-rs-L32-L138 (
export_with_datafusion) is the interesting bit. It basically lists all files in a directory, takes N files and compacts them. I tried giving hints to the optimizer that the sources are already sorted but it doesn't seem to help. The row group size is set to 1M.Giving less memory (E.g 12 or 16 gb), I am running into the below issue
I am trying to understand why the spill is not happening efficiently (I am relatively new to DataFusion). Looking for any help/hints to reduce the memory utilization
Beta Was this translation helpful? Give feedback.
All reactions