-
Notifications
You must be signed in to change notification settings - Fork 1.5k
GH-3356: Add buffers allocated by vectored IO for releasing #3357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
annimesh2809
wants to merge
1
commit into
apache:master
Choose a base branch
from
annimesh2809:releasing_allocator
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the right direction. Is it better to make it a contract for ByteBufferAllocator implementations to take this responsibility?
WDYT? @gszadovszky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get the concept of this PR.
ByteBufferAllocatoractually has the contract of releasing theByteBufferallocated by it. The only thing we need to do is to invoke this method at the right time when the related buffer is not needed anymore.The
ByteBufferReleaserconcept came into the scope only to easily postpone the release invocation to the time we really can release the relatedByteBuffers. (By usingBytesInputwe may pass the related buffers around and it is not always clear when to release them.)@annimesh2809, I would suggest you to implement a unit test to reproduce the issue first. You may use
TrackingByteBufferAllocatorto fail if any allocated buffer is not released during the execution. You may find examples of its usage among the unit tests. If you find the issue, you'll need to ensure that the related allocated buffers are get back to their allocator to release them. You may use the existing patterns we already have or invent new ones if necessary.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I try to build parquet-mr with hadoop 3.4.2 without any additional changes, I see
testRangeFilteringtest case (and some others) ofTestParquetReadersuite fail. TheTrackingByteBufferAllocatorreveals that the unreleased allocation happens in:The root cause here seems to be that ChecksumFileSystem (coming from hadoop) starts supporting
readVectored: https://github.com/apache/hadoop/blob/branch-3.4.2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L460-L513.ChecksumFileSystem.readVectoredinternally does more allocations like:which are not marked for release by
ByteBufferReleaser.Also with vectored reads, it is not sufficient to mark the buffers returned by the allocator for release, as they are sliced internally and the returned buffer object is different even though the underlying memory remains the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context, @annimesh2809.
Why do you need to track the allocated buffers to be released later instead of simply giving the
allocateandreleasemethods of theByteBufferAllocatorinstance to the related Hadoop API via the implementations ofSeekableInputStream.readVectored? I assume the Hadoop code would release the allocated buffers as soon as they are not needed anymore.