-
Notifications
You must be signed in to change notification settings - Fork 1k
[Parquet] Account for FileDecryptor in ParquetMetaData heap size calculation #8671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
} | ||
} | ||
|
||
impl<K: HeapSize, V: HeapSize> HeapSize for HashMap<K, V> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely to be an underestimate of the HashMap heap size as @etseidl mentioned in #8472 (comment). Internally std::collection::HashMap
uses a hashbrown::HashMap
which holds a block of (K, V)
pairs, which could have a different size than sizeof::<K>() + sizeof::<V>()
due to alignment. Although it looks like the size does match for the (String, Vec<u8>)
pair used for column keys. The number of allocated buckets used is also based on a load factor applied to the capacity, so the capacity will be an underestimate of the number of buckets, and there isn't a way to get the number of internal buckets in the public API.
I'm not sure how much we want to depend on internal implementation details of HashMap
to improve the accuracy of this. And whether it's better to under of overestimate the memory used. Maybe it would be better for this to be an overestimate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if the point is to not overrun available memory, it's probably safer to overestimate.
// Ring's LessSafeKey doesn't allocate on the heap | ||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into changing FileDecryptor
to hold a Vec<u8>
for the footer key instead of an Arc<dyn BlockDecryptor>
to simplify the heap size calculation, as mentioned in #8472 (comment). But this decreased read speed by about 10% in a small test case, and also increased the memory usage.
After looking more closely at the LessSafeKey
implementation, it doesn't appear to hold any heap allocated memory.
I think it's fine to assume the heap size is going to stay zero. The ring crate has an alloc
feature that isn't required for the aead
module, so it would be a big change for this to start allocating.
// The retriever is a user-defined type we don't control, | ||
// so we can't determine the heap size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in #8472, we could potentially add a new trait method to allow a key retriever to provide a heap size later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running this to ground @adamreeve! I think we can punt on the retriever for now. We just need to decide what to do with hash map. 🤔
/// Estimate the size in bytes required for the file decryptor. | ||
/// This is important to track the memory usage of cached Parquet meta data, | ||
/// and is used via [`crate::file::metadata::ParquetMetaData::memory_size`]. | ||
/// Note that when a [`KeyRetriever`] is used, its heap size won't be included | ||
/// and the result will be an underestimate. | ||
/// If the [`FileDecryptionProperties`] are shared between multiple files then the | ||
/// heap size may also be an overestimate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
} | ||
} | ||
|
||
impl<K: HeapSize, V: HeapSize> HeapSize for HashMap<K, V> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if the point is to not overrun available memory, it's probably safer to overestimate.
I've updated this to be more accurate and tried to match the actual hashmap implementation more closely without replicating all the details exactly. Eg. it doesn't account for some alignment calculations and the group size is architecture dependent so this might be an overestimate. The calculation of the number of buckets could maybe be simplified further, but I felt like small hash maps would be quite common so I didn't want to overestimate this too much. This does feel a bit too complex, but then changing the memory characteristics of the standard HashMap type seems like something that shouldn't happen often so maybe this is OK... |
Changing this back to draft as I realised the handling of The implementation of |
Hmm, this opens quite the can of worms. Now I'm looking at And what about |
Which issue does this PR close?
encryption
is enabled #8472.Rationale for this change
Makes the metadata heap size calculation more accurate when reading encrypted Parquet files, which helps to better manage caches of Parquet metadata.
What changes are included in this PR?
FileDecryptor
inParquetMetaData
KeyRetriever
Are these changes tested?
Yes, there's a new unit test added that computes the heap size with a decryptor.
I also did a manual test that created a test Parquet file with 100 columns using per-column encryption keys, and loaded 10,000 copies of the
ParquetMetaData
into a vector.heaptrack
reported 1.1 GB memory heap allocated in this test program. Prior to this change, the sum of the metadata was reported as 879.2 MB, and afterwards it was 961.7 MB.I'm not sure if there's any better way to test the accuracy of this calculation?
Are there any user-facing changes?
No
This was co-authored by @etseidl. I haven't changed their original implementation much beyond adding a test and some comments, and updating the HeapSize implementation for HashMap.