-
Notifications
You must be signed in to change notification settings - Fork 61
Open
Labels
bugNewly identified bugNewly identified bug
Description
What's happening now can be reproduced on main by adding Process.sleep(10_000) right after line 39 of worker_channel.ex then running any run. Here's how the bug works:
- The
ws-workerjoins the worker channel and requests work via the "claim" API. - Every once in a while, the DB poops itself and the query takes a very long time. (This is what we replicate with the
Process.sleep) - The
ws-workergoes away, after telling us✘ TIMEOUT on claim. Runs may be lost. - The DB query then finishes, setting the run to
:claimedand replying to Nobody At All that it's got a run ready to be executed. - Nobody At All does anything (the worker is long gone at this point) and the run sits there as
:claimedfor N minutes—controlled by the Janitor's grace period. - The Janitor (after N minutes) comes by and marks this run as
:lost, secretly giving it the status "LostAfterClaim". - We tell the user that we don't know what happened, apologizing profusely for losing the run and saying, "Sorry, we have no more info and there's nothing you can do."
Really, it seems like there is something (or there are a couple of somethings) we can do here:
- In the
worker_channel, Make the socket timeout GREATER OR EQUAL to the query timeout so that we don't shut the door before the positive reply comes. - In the
worker_channelCheck if the connection to the worker is still alive before we{:reply, {:ok, %{runs: runs}}, socket}. If it's not still alive, then roll back the transaction and/or (more simply) set the run state back to:availableso another worker can come along and claim it. - In the
janitor.ex, don't set runs to:lost(withLostAfterClaim), set them back to:availableso they get picked up. If we're certain that the run hasn't been started (:claimedis a different state than:started) then simply put it back in the queue... it's not lost, it's still waiting for some worker to work on it.
@stuartc , @josephjclark , @rorymckinley , a penny for your thoughts here? Of the last 230 lost runs we know of, 212 were LostAfterClaim and it seems that implementing any/all of these three enhancements might put a real dent in that.
I've opened an illustrative fix for concept 2 here: #3566 - if it has legs I'll write tests. so far just testing manually and it works a treat.
Metadata
Metadata
Assignees
Labels
bugNewly identified bugNewly identified bug
Type
Projects
Status
In review