Skip to content

Conversation

@bthery
Copy link

@bthery bthery commented Aug 23, 2018

This pull request fixes one crash and one hang observed while using Mooshika.

  • One crash in msk_cq_thread() which calls msk_cq_delfd() without checking if comp_channel is still there
  • One hang in msk_trans_destroy() in the case rdma_disconnect() failed. The trans state won't change (no cm event issued), then msk_trans_destroy() may wait forever on the msk_cond_wait() following the rdma_disconnect() call.

See commit logs for a bit more details.

bthery added 2 commits August 23, 2018 09:54
…sk_cq_delfd()

Similarly to what is done in msk_cma_event_handler(), check that
trans->comp_channel is still there before calling msk_cq_del_fd().
I had a crah where comp_channel was already freed when msk_cq_thread()
was still trying to access it.
msk_destroy_trans() can hang forever if rdma_disconnect() failed.
Indeed, if it failed for some reason, no cm disconnect event will
be received, and trans->state will not reach MSK_CLOSED.
Thread calling msk_destroy_trans() will then block on the msk_cond_wait()
desperatly waiting for a state change.

Today, msk_destroy_trans() doesn't return an error code. We only log errors.
Maybe it could be interesting in the future to change the interface
to be able to notify the caller when an error occured during
its execution.
Copy link
Collaborator

@martinetd martinetd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what caused rdma_disconnect to fail in the first place? some race with destroy? I'm worried that if there is such a race then we could get much worse than a hang, but e.g. issue the disconnect a reused FD instead... It would be good to understand the root cause and ideally fix that as well - we should both do this check and fix the reason it happens as well.

(Style-wise, please make sure the first line of your commit messages are <= 72 characters; github is particularily annoying with long subjects but it is good practice anyway.)

// closing trans, skip this, will be done on flush
msk_cq_delfd(trans);
if (trans->comp_channel)
msk_cq_delfd(trans);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I realize the old code (the other place where we check comp_channel before calling msk_cq_delfd) has the same problem, but I'm wondering if this is safe race-wise...
If the application calls msk_destroy_trans themselves for a reason or another I believe this could happen; I'd call msk_destroy_qp with trans->cm_lock taken and move the other msk_cq_delfd within the locked section as well. What do you think?

(In the first place we probably don't need a different comp_channel for each child trans now that I'm reading this code again, if we call ibv_create_cq with the "parent's" comp_channel it should just work.. We'd need to shuffle code a bit and call ibv_get_cq_event to grab the cq and get the trans from its cq_context though)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I've just realized I left this PR on its own for a long time.

I don't remember exactly how the hang occurs. I think it was related with processes that fork.
(I've some code for adding some kind of support for fork in mooshika (this is something we need to support in our app even if fork should be avoided when using verbs))

Concerning the crash, I don't know the internals of mooshika very well, but I tend to agree with you about doing the msk_destroy_qp() and msk_cq_delfd() with the cm_lock held.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder how ib deals with forking when memory is locked and already registered... I'll have to admit I never tried.

Anyway, how do you want to proceed? Can you still reproduce somehow to test if holding the cm_lock also fixes the problem?
Given I don't have the reproducer I can't really go about adding the lock and closing this if I don't know if it really helps; but I'd rather not merge this one if the race gets fixed with a lock.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the fork() + ib, the general advice is: don't fork!
Anyway there is the interface ibv_fork_init() that's supposed to help with fork() support. My understanding is it keeps track of registered memory regions and mark them as "don't fork".

I've a branch in my mooshika repo which adds an helper to expose ibv_fork_init() (via msk_fork_init(), clever name, isn't it) and adds a new interface msk_lib_reset() similar to what is proposed by the Mellanox version of librdmacm: rdma_lib_reset(), which can be called to reset the library global state in a child process. I will create a PR for it sometime.

About the reproducer for the hang, I think it happens while I was doing tests around the fork(), and msk_destroy_trans() was called twice for the same transport (probably once in the parent and once in the child). It was in some faulty code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants