forked from martinetd/mooshika
-
Notifications
You must be signed in to change notification settings - Fork 2
Crash and hang fixes #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bthery
wants to merge
2
commits into
cea-hpc:master
Choose a base branch
from
bthery:hang-and-crash-fixes
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I realize the old code (the other place where we check
comp_channelbefore callingmsk_cq_delfd) has the same problem, but I'm wondering if this is safe race-wise...If the application calls
msk_destroy_transthemselves for a reason or another I believe this could happen; I'd callmsk_destroy_qpwithtrans->cm_locktaken and move the othermsk_cq_delfdwithin the locked section as well. What do you think?(In the first place we probably don't need a different comp_channel for each child trans now that I'm reading this code again, if we call ibv_create_cq with the "parent's" comp_channel it should just work.. We'd need to shuffle code a bit and call ibv_get_cq_event to grab the cq and get the trans from its
cq_contextthough)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I've just realized I left this PR on its own for a long time.
I don't remember exactly how the hang occurs. I think it was related with processes that fork.
(I've some code for adding some kind of support for fork in mooshika (this is something we need to support in our app even if fork should be avoided when using verbs))
Concerning the crash, I don't know the internals of mooshika very well, but I tend to agree with you about doing the msk_destroy_qp() and msk_cq_delfd() with the cm_lock held.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I wonder how ib deals with forking when memory is locked and already registered... I'll have to admit I never tried.
Anyway, how do you want to proceed? Can you still reproduce somehow to test if holding the cm_lock also fixes the problem?
Given I don't have the reproducer I can't really go about adding the lock and closing this if I don't know if it really helps; but I'd rather not merge this one if the race gets fixed with a lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the fork() + ib, the general advice is: don't fork!
Anyway there is the interface ibv_fork_init() that's supposed to help with fork() support. My understanding is it keeps track of registered memory regions and mark them as "don't fork".
I've a branch in my mooshika repo which adds an helper to expose ibv_fork_init() (via msk_fork_init(), clever name, isn't it) and adds a new interface msk_lib_reset() similar to what is proposed by the Mellanox version of librdmacm: rdma_lib_reset(), which can be called to reset the library global state in a child process. I will create a PR for it sometime.
About the reproducer for the hang, I think it happens while I was doing tests around the fork(), and msk_destroy_trans() was called twice for the same transport (probably once in the parent and once in the child). It was in some faulty code.