Runs are marked as lost when they should be re-enqueued

What's happening now can be reproduced on `main` by adding `Process.sleep(10_000)` right after line 39 of `worker_channel.ex` then running any run. Here's how the bug works:

1. The `ws-worker` joins the worker channel and requests work via the "claim" API.
2. Every once in a while, the DB poops itself and the query takes a very long time. (This is what we replicate with the `Process.sleep`)
3. The `ws-worker` goes away, after telling us [`✘ TIMEOUT on claim. Runs may be lost.`](https://github.com/OpenFn/kit/blob/main/packages/ws-worker/src/api/claim.ts#L119-L122)
4. The DB query _then_ finishes, setting the run to `:claimed` and replying to **_Nobody At All_** that it's got a run ready to be executed.
5. **_Nobody At All_** does anything (the worker is long gone at this point) and the run sits there as `:claimed` for N minutes—controlled by the Janitor's grace period.
6. The Janitor (after N minutes) comes by and marks this run as `:lost`, secretly giving it the status "LostAfterClaim".
7. We tell the user that we don't know what happened, apologizing profusely for losing the run and saying, "Sorry, we have no more info and there's nothing you can do."

Really, it seems like there _is_ something (or there are a couple of somethings) we can do here:

1. In the `worker_channel`, Make the socket timeout GREATER OR EQUAL to the query timeout so that we don't shut the door before the positive reply comes.
2. In the `worker_channel` Check if the connection to the worker is still alive before we `{:reply, {:ok, %{runs: runs}}, socket}`. If it's not still alive, then roll back the transaction and/or (more simply) set the run state back to `:available` so another worker can come along and claim it.
3. In the `janitor.ex`, don't set runs to `:lost` (with `LostAfterClaim`), set them back to `:available` so they get picked up. If we're **_certain_** that the run hasn't been started (`:claimed` is a different state than `:started`) then simply put it back in the queue... it's not lost, it's still waiting for some worker to work on it.

@stuartc , @josephjclark , @rorymckinley , a penny for your thoughts here? Of the last 230 lost runs we know of, 212 were `LostAfterClaim` and it seems that implementing any/all of these three enhancements might put a real dent in that.

I've opened an illustrative fix for concept 2 here: #3566 - if it has legs I'll write tests. so far just testing manually and it works a treat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Runs are marked as lost when they should be re-enqueued #3565

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runs are marked as lost when they should be re-enqueued #3565

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions