Skip to content

Runs are marked as lost when they should be re-enqueued #3565

@taylordowns2000

Description

@taylordowns2000

What's happening now can be reproduced on main by adding Process.sleep(10_000) right after line 39 of worker_channel.ex then running any run. Here's how the bug works:

  1. The ws-worker joins the worker channel and requests work via the "claim" API.
  2. Every once in a while, the DB poops itself and the query takes a very long time. (This is what we replicate with the Process.sleep)
  3. The ws-worker goes away, after telling us ✘ TIMEOUT on claim. Runs may be lost.
  4. The DB query then finishes, setting the run to :claimed and replying to Nobody At All that it's got a run ready to be executed.
  5. Nobody At All does anything (the worker is long gone at this point) and the run sits there as :claimed for N minutes—controlled by the Janitor's grace period.
  6. The Janitor (after N minutes) comes by and marks this run as :lost, secretly giving it the status "LostAfterClaim".
  7. We tell the user that we don't know what happened, apologizing profusely for losing the run and saying, "Sorry, we have no more info and there's nothing you can do."

Really, it seems like there is something (or there are a couple of somethings) we can do here:

  1. In the worker_channel, Make the socket timeout GREATER OR EQUAL to the query timeout so that we don't shut the door before the positive reply comes.
  2. In the worker_channel Check if the connection to the worker is still alive before we {:reply, {:ok, %{runs: runs}}, socket}. If it's not still alive, then roll back the transaction and/or (more simply) set the run state back to :available so another worker can come along and claim it.
  3. In the janitor.ex, don't set runs to :lost (with LostAfterClaim), set them back to :available so they get picked up. If we're certain that the run hasn't been started (:claimed is a different state than :started) then simply put it back in the queue... it's not lost, it's still waiting for some worker to work on it.

@stuartc , @josephjclark , @rorymckinley , a penny for your thoughts here? Of the last 230 lost runs we know of, 212 were LostAfterClaim and it seems that implementing any/all of these three enhancements might put a real dent in that.

I've opened an illustrative fix for concept 2 here: #3566 - if it has legs I'll write tests. so far just testing manually and it works a treat.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugNewly identified bug

    Type

    No type

    Projects

    Status

    In review

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions