[Parquet] Account for FileDecryptor in ParquetMetaData heap size calculation #8671

adamreeve · 2025-10-21T02:30:23Z

Which issue does this PR close?

Closes ParquetMetaData memory size is not reported accurately when encryption is enabled #8472.

Rationale for this change

Makes the metadata heap size calculation more accurate when reading encrypted Parquet files, which helps to better manage caches of Parquet metadata.

What changes are included in this PR?

Accounts for heap allocations related to the FileDecryptor in ParquetMetaData
Does not account for any user-provided KeyRetriever

Are these changes tested?

Yes, there's a new unit test added that computes the heap size with a decryptor.

I also did a manual test that created a test Parquet file with 100 columns using per-column encryption keys, and loaded 10,000 copies of the ParquetMetaData into a vector. heaptrack reported 1.1 GB memory heap allocated in this test program. Prior to this change, the sum of the metadata was reported as 879.2 MB, and afterwards it was 961.7 MB.

I'm not sure if there's any better way to test the accuracy of this calculation?

Are there any user-facing changes?

No

This was co-authored by @etseidl. I haven't changed their original implementation much beyond adding a test and some comments, and updating the HeapSize implementation for HashMap.

adamreeve · 2025-10-21T02:33:29Z

parquet/src/file/metadata/memory.rs

    }
 }

+impl<K: HeapSize, V: HeapSize> HeapSize for HashMap<K, V> {


This is likely to be an underestimate of the HashMap heap size as @etseidl mentioned in #8472 (comment). Internally std::collection::HashMap uses a hashbrown::HashMap which holds a block of (K, V) pairs, which could have a different size than sizeof::<K>() + sizeof::<V>() due to alignment. Although it looks like the size does match for the (String, Vec<u8>) pair used for column keys. The number of allocated buckets used is also based on a load factor applied to the capacity, so the capacity will be an underestimate of the number of buckets, and there isn't a way to get the number of internal buckets in the public API.

I'm not sure how much we want to depend on internal implementation details of HashMap to improve the accuracy of this. And whether it's better to under of overestimate the memory used. Maybe it would be better for this to be an overestimate?

Yeah, if the point is to not overrun available memory, it's probably safer to overestimate.

adamreeve · 2025-10-21T02:41:15Z

parquet/src/encryption/ciphers.rs

+        // Ring's LessSafeKey doesn't allocate on the heap
+        0


I looked into changing FileDecryptor to hold a Vec<u8> for the footer key instead of an Arc<dyn BlockDecryptor> to simplify the heap size calculation, as mentioned in #8472 (comment). But this decreased read speed by about 10% in a small test case, and also increased the memory usage.

After looking more closely at the LessSafeKey implementation, it doesn't appear to hold any heap allocated memory.

I think it's fine to assume the heap size is going to stay zero. The ring crate has an alloc feature that isn't required for the aead module, so it would be a big change for this to start allocating.

adamreeve · 2025-10-21T02:45:50Z

parquet/src/encryption/decrypt.rs

+                // The retriever is a user-defined type we don't control,
+                // so we can't determine the heap size.


As discussed in #8472, we could potentially add a new trait method to allow a key retriever to provide a heap size later.

etseidl

Thanks for running this to ground @adamreeve! I think we can punt on the retriever for now. We just need to decide what to do with hash map. 🤔

etseidl · 2025-10-21T15:16:38Z

parquet/src/encryption/decrypt.rs

+/// Estimate the size in bytes required for the file decryptor.
+/// This is important to track the memory usage of cached Parquet meta data,
+/// and is used via [`crate::file::metadata::ParquetMetaData::memory_size`].
+/// Note that when a [`KeyRetriever`] is used, its heap size won't be included
+/// and the result will be an underestimate.
+/// If the [`FileDecryptionProperties`] are shared between multiple files then the
+/// heap size may also be an overestimate.


etseidl · 2025-10-21T15:20:33Z

parquet/src/file/metadata/memory.rs

    }
 }

+impl<K: HeapSize, V: HeapSize> HeapSize for HashMap<K, V> {


Yeah, if the point is to not overrun available memory, it's probably safer to overestimate.

adamreeve · 2025-10-22T02:21:59Z

We just need to decide what to do with hash map

I've updated this to be more accurate and tried to match the actual hashmap implementation more closely without replicating all the details exactly. Eg. it doesn't account for some alignment calculations and the group size is architecture dependent so this might be an overestimate.

The calculation of the number of buckets could maybe be simplified further, but I felt like small hash maps would be quite common so I didn't want to overestimate this too much.

This does feel a bit too complex, but then changing the memory characteristics of the standard HashMap type seems like something that shouldn't happen often so maybe this is OK...

adamreeve · 2025-10-22T06:38:16Z

Changing this back to draft as I realised the handling of FileDecryptor::footer_decryptor isn't correct and I'm not sure yet exactly how to handle this.

The implementation of HeapSize for Arc<T> looks wrong, this should match the implementation for Box where the size of the contained item is included. But even if that's fixed, the Arc impl isn't used for an Arc<dyn BlockDecryptor>. Instead the Arc is dereferenced and only the HeapSize implementation of the contained type is used.

etseidl · 2025-10-22T13:58:54Z

Hmm, this opens quite the can of worms. Now I'm looking at HeapSize for the schema, and we may be overcounting there. SchemaDescriptor is already counting the heap for the tree of Type pointers, but then each ColumnDescriptor is also counting the same objects. Perhaps the impl for ColumnDescriptor should be more like self.path.heap_size() + 2 * std::mem::size_of::<usize>() 🤷

And what about Vec<Arc<T>>? Does sizeof for Arc include the pointers and ref counts as well?

etseidl and others added 3 commits October 21, 2025 15:07

checkpoint

a6f033e

Format and update comments

ec03919

Add unit test

f58565e

github-actions bot added the parquet Changes to the parquet crate label Oct 21, 2025

adamreeve commented Oct 21, 2025

View reviewed changes

adamreeve requested a review from etseidl October 21, 2025 02:47

etseidl approved these changes Oct 21, 2025

View reviewed changes

More accurate HashMap heap size calculation

a53eb7b

adamreeve marked this pull request as draft October 22, 2025 06:38

Fix HeapSize implementation for Arc<T>

0c8ab49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Parquet] Account for FileDecryptor in ParquetMetaData heap size calculation #8671

[Parquet] Account for FileDecryptor in ParquetMetaData heap size calculation #8671

adamreeve commented Oct 21, 2025 •

edited

Loading

Uh oh!

adamreeve Oct 21, 2025

Uh oh!

etseidl Oct 21, 2025

Uh oh!

adamreeve Oct 21, 2025 •

edited

Loading

Uh oh!

adamreeve Oct 21, 2025

Uh oh!

etseidl left a comment

Uh oh!

etseidl Oct 21, 2025

Uh oh!

etseidl Oct 21, 2025

Uh oh!

adamreeve commented Oct 22, 2025

Uh oh!

adamreeve commented Oct 22, 2025

Uh oh!

etseidl commented Oct 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// The retriever is a user-defined type we don't control,
		// so we can't determine the heap size.

[Parquet] Account for FileDecryptor in ParquetMetaData heap size calculation #8671

Are you sure you want to change the base?

[Parquet] Account for FileDecryptor in ParquetMetaData heap size calculation #8671

Conversation

adamreeve commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adamreeve Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

adamreeve Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamreeve Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

etseidl Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

adamreeve commented Oct 22, 2025

Uh oh!

adamreeve commented Oct 22, 2025

Uh oh!

etseidl commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adamreeve commented Oct 21, 2025 •

edited

Loading

adamreeve Oct 21, 2025 •

edited

Loading

etseidl commented Oct 22, 2025 •

edited

Loading