Handle Dask worker-loss task failures#21673
Handle Dask worker-loss task failures#21673harsh21234i wants to merge 3 commits intoPrefectHQ:mainfrom
Conversation
|
hi @harsh21234i - the enthusiasm is much appreciated! but i think taking time to reproduce and empirically iterate on these issues will be the most productive use of time! we at prefect also make heavy use of agent harnesses, but they often fail to reckon with the nuance that these issues require you seem to be hoisting details of the engine into the task runner implementation, which seems like a bit of a design leak here. piping my response here directly back into claude et al will probably not yield a desirable outcome let me know if you think is unfair, if this was all very intentionally designed. happy to iterate with you on this |
|
some suggestions on how you might iterate on this
mikicz's reply in #21638 — "no terminal exception in any log, just OOM restarts" — doesn't rule out (b) or (c). pinning which branch we're in should determine the shape of the fix and the layer it belongs at |
|
I agree the right next step is to reproduce this more faithfully and pin down which branch we’re actually in before changing the fix shape or layer. I had been assuming the wrapped future was ending in an exception path, but you’re right that the reporter’s logs don’t rule out a non-State return or a hang. I’ll reproduce it against a constrained Dask cluster and check what |
|
sounds great! thanks |
|
@zzstoatzz I reproduced it more faithfully on main with a constrained Dask cluster and confirmed we were in the path where The updated fix keeps prefect-dask narrow. The Dask future wrapper now just records unexpected wrapped-future I also replaced the earlier mock-heavy tests with a more faithful regression that exhausts worker failures under a |
|
hey @zzstoatzz have you went through this is it perfect now? |
Hey @desertaxle
Summary
This fixes
prefect-daskso task runs are not left inRUNNINGwhen Dask ultimately fails a task at the scheduler level after repeated worker loss.When the wrapped Dask future raises a scheduler-side exception such as
KilledWorker, Prefect now converts that into a terminal task-run state instead ofswallowing the exception and falling back to a stale API state.
Fixes #21638.
What changed
PrefectDaskFuture.wait()to treat unexpected Dask future exceptions as terminal task failures instead of returning without a final statePrefectDaskFuture.result()to do the same when the wrapped future raises during result retrievalCrashedtask-run state using Prefect’s existing exception-to-state logicWhy
The issue report describes a Dask cluster where workers are repeatedly killed under memory pressure until Dask gives up. In that case, the wrapped
distributed future can end in a scheduler exception like
KilledWorkerinstead of returning a PrefectState.Previously, the
prefect-daskfuture wrapper would swallow that exception in.wait(), leave_final_stateunset, and later read the task run stateback from the API. If that API row was still
RUNNING, Prefect would treat the task as non-terminal and downstream dependencies could remain blockedindefinitely.
With this change, scheduler-level Dask failures are translated into a terminal Prefect state immediately.
Tests
Added regression coverage in
src/integrations/prefect-dask/tests/test_task_runners.pyfor:KeyboardInterruptresults in Dask worker loss and the Prefect future still resolves toCRASHEDValidation
uv run --project src/integrations/prefect-dask ruff check src/integrations/prefect-dask/prefect_dask/task_runners.py src/integrations/prefect-dask/ tests/test_task_runners.pyuv run --project src/integrations/prefect-dask ruff format --check src/integrations/prefect-dask/prefect_dask/task_runners.py src/integrations/prefect- dask/tests/test_task_runners.pyuv run --project src/integrations/prefect-dask pytest src/integrations/prefect-dask/tests/test_task_runners.py -k "PrefectDaskFuture or wait_captures_exceptions_as_crashed_state"Results: