Skip to content

Director-v2 restart a computational job if it loses/reconnects to a private cluster for a short time #6793

@sanderegg

Description

@sanderegg

Scenario

  1. a Computational pipeline is started on a private cluster
  2. the pipeline is scheduled on the private cluster, e.g. the task status is set to STARTED
  3. for a short time the dask-scheduler on the private cluster is not reachable,
  4. the dv-2 checks during that time for the task status via its dask client, fails to connect returns UNKNOWN
  5. dv-2 sets the task back to WAITING_FOR_CLUSTER,
  6. on the next iteration of the scheduler the dask-scheduler is reachable again,

--> dv-2 does not check if the task is already running and starts the task again, cause it does not check for that use-case
--> the private cluster runs the task twice, potentially running the task longer than needed, wasting time and money.

Metadata

Metadata

Assignees

Labels

High Prioritya totally crucial bug/feature to be fixed asapbugbuggy, it does not work as expected

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions