Crash and hang fixes #2

bthery · 2018-08-23T08:30:31Z

This pull request fixes one crash and one hang observed while using Mooshika.

One crash in msk_cq_thread() which calls msk_cq_delfd() without checking if comp_channel is still there
One hang in msk_trans_destroy() in the case rdma_disconnect() failed. The trans state won't change (no cm event issued), then msk_trans_destroy() may wait forever on the msk_cond_wait() following the rdma_disconnect() call.

See commit logs for a bit more details.

…sk_cq_delfd() Similarly to what is done in msk_cma_event_handler(), check that trans->comp_channel is still there before calling msk_cq_del_fd(). I had a crah where comp_channel was already freed when msk_cq_thread() was still trying to access it.

msk_destroy_trans() can hang forever if rdma_disconnect() failed. Indeed, if it failed for some reason, no cm disconnect event will be received, and trans->state will not reach MSK_CLOSED. Thread calling msk_destroy_trans() will then block on the msk_cond_wait() desperatly waiting for a state change. Today, msk_destroy_trans() doesn't return an error code. We only log errors. Maybe it could be interesting in the future to change the interface to be able to notify the caller when an error occured during its execution.

martinetd

Do you know what caused rdma_disconnect to fail in the first place? some race with destroy? I'm worried that if there is such a race then we could get much worse than a hang, but e.g. issue the disconnect a reused FD instead... It would be good to understand the root cause and ideally fix that as well - we should both do this check and fix the reason it happens as well.

(Style-wise, please make sure the first line of your commit messages are <= 72 characters; github is particularily annoying with long subjects but it is good practice anyway.)

martinetd · 2018-08-27T02:14:36Z

src/trans_rdma.c

 				// closing trans, skip this, will be done on flush
-				msk_cq_delfd(trans);
+				if (trans->comp_channel)
+					msk_cq_delfd(trans);


hm, I realize the old code (the other place where we check comp_channel before calling msk_cq_delfd) has the same problem, but I'm wondering if this is safe race-wise...
If the application calls msk_destroy_trans themselves for a reason or another I believe this could happen; I'd call msk_destroy_qp with trans->cm_lock taken and move the other msk_cq_delfd within the locked section as well. What do you think?

(In the first place we probably don't need a different comp_channel for each child trans now that I'm reading this code again, if we call ibv_create_cq with the "parent's" comp_channel it should just work.. We'd need to shuffle code a bit and call ibv_get_cq_event to grab the cq and get the trans from its cq_context though)

Hi, I've just realized I left this PR on its own for a long time.

I don't remember exactly how the hang occurs. I think it was related with processes that fork.
(I've some code for adding some kind of support for fork in mooshika (this is something we need to support in our app even if fork should be avoided when using verbs))

Concerning the crash, I don't know the internals of mooshika very well, but I tend to agree with you about doing the msk_destroy_qp() and msk_cq_delfd() with the cm_lock held.

Hmm, I wonder how ib deals with forking when memory is locked and already registered... I'll have to admit I never tried.

Anyway, how do you want to proceed? Can you still reproduce somehow to test if holding the cm_lock also fixes the problem?
Given I don't have the reproducer I can't really go about adding the lock and closing this if I don't know if it really helps; but I'd rather not merge this one if the race gets fixed with a lock.

About the fork() + ib, the general advice is: don't fork!
Anyway there is the interface ibv_fork_init() that's supposed to help with fork() support. My understanding is it keeps track of registered memory regions and mark them as "don't fork".

I've a branch in my mooshika repo which adds an helper to expose ibv_fork_init() (via msk_fork_init(), clever name, isn't it) and adds a new interface msk_lib_reset() similar to what is proposed by the Mellanox version of librdmacm: rdma_lib_reset(), which can be called to reset the library global state in a child process. I will create a PR for it sometime.

About the reproducer for the hang, I think it happens while I was doing tests around the fork(), and msk_destroy_trans() was called twice for the same transport (probably once in the parent and once in the child). It was in some faulty code.

bthery added 2 commits August 23, 2018 09:54

martinetd reviewed Aug 27, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Crash and hang fixes #2

Crash and hang fixes #2

Uh oh!

bthery commented Aug 23, 2018

Uh oh!

martinetd left a comment

Uh oh!

martinetd Aug 27, 2018

Uh oh!

bthery May 9, 2019

Uh oh!

martinetd May 13, 2019

Uh oh!

bthery May 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Crash and hang fixes #2

Are you sure you want to change the base?

Crash and hang fixes #2

Uh oh!

Conversation

bthery commented Aug 23, 2018

Uh oh!

martinetd left a comment

Choose a reason for hiding this comment

Uh oh!

martinetd Aug 27, 2018

Choose a reason for hiding this comment

Uh oh!

bthery May 9, 2019

Choose a reason for hiding this comment

Uh oh!

martinetd May 13, 2019

Choose a reason for hiding this comment

Uh oh!

bthery May 17, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants