Feedback on high memory usage when merging N parquet files #18833

ndchandar · 2025-11-20T02:29:40Z

ndchandar
Nov 20, 2025

Hello,
I am writing a program that takes N parquet files (where N = 40). Each source parquet file is about ~6 to ~8 MB in size and are Zstd compressed. They are compacted/combined to produce a bigger sized parquet file (~220 to ~250 MB). It appears that we need as much as ~24 GB of memory to have a successful compaction. This gist https://gist.github.com/ndchandar/3900558ff719cefeb8b058e36a18f8be#file-parquet_rewriter-rs-L32-L138 (export_with_datafusion) is the interesting bit. It basically lists all files in a directory, takes N files and compacts them. I tried giving hints to the optimizer that the sources are already sorted but it doesn't seem to help. The row group size is set to 1M.

Giving less memory (E.g 12 or 16 gb), I am running into the below issue

Caused by:
    Resources exhausted: Failed to allocate additional 2.0 MB for ExternalSorterMerge[4] with 49.8 MB already allocated for this reservation - 1826.2 KB remain available for the total pool

I am trying to understand why the spill is not happening efficiently (I am relatively new to DataFusion). Looking for any help/hints to reduce the memory utilization

ndchandar · 2025-11-24T23:44:49Z

ndchandar
Nov 24, 2025
Author

I was able to reduce parallelism by tuning datafusion.execution.target_partitions for our workloads. This resulted in lesser memory and cpu usage. I also bumpled datafusion.execution.parquet.write_batch_size to a much higher number (from the default 8192 to 65536. Are there other parameters that I could tune? I am trying to find the balance between optimal memory/cpu usage versus being reasonably quick with regards to compaction/merging

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feedback on high memory usage when merging N parquet files #18833

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feedback on high memory usage when merging N parquet files #18833

Uh oh!

Uh oh!

ndchandar Nov 20, 2025

Replies: 1 comment

Uh oh!

ndchandar Nov 24, 2025 Author

ndchandar
Nov 20, 2025

ndchandar
Nov 24, 2025
Author