Bug
When using errorStrategy 'finish' with the Azure Batch executor, Nextflow hangs indefinitely if worker nodes become unhealthy (e.g., NodeNotReady) while tasks are running. The JVM never exits. The only recovery is kill -9 or Ctrl+C (which triggers abort() via signal handler, bypassing the bug).
This also affects Google Batch. AWS Batch is partially protected by its service-level node management. Kubernetes is immune due to explicit NodeTerminationException handling.
Root Cause
The hang is caused by a design gap in Session.cancel() — the shutdown path used by errorStrategy 'finish'.
When a task fails and exhausts maxRetries, Session.fault() routes to cancel() (not abort()) for the FINISH error strategy. cancel() sets cancelled=true and forces processesBarrier, but critically:
- Does not force
monitorsBarrier
- Does not call
shutdown0() (which runs cleanup hooks including TaskPollingMonitor.cleanup())
This creates a deadlock:
Session.await() blocks on monitorsBarrier.awaitCompletion() — a while(true) loop with no timeout
monitorsBarrier waits for TaskPollingMonitor.pollLoop() to exit and call arrive()
pollLoop() break condition when cancelled: session.isCancelled() && runningQueue.size() == 0
- Tasks on dead nodes remain in ACTIVE/RUNNING state (Azure hasn't marked them COMPLETED)
runningQueue never drains → pollLoop never exits → monitorsBarrier never completes
Session.destroy() calls shutdown0() but only after await() returns — which it never does
TaskPollingMonitor.cleanup() (which kills remaining tasks) is registered as a shutdown hook and never runs
By contrast, Session.abort() forces both barriers and calls shutdown0(), so it never hangs.
Clarification on 409 Errors
The Unable to cleanup batch task warnings (HTTP 409 / NodeNotReady) visible in logs are a cosmetic issue, not the hang cause. These come from deleteTask() for tasks that did complete but whose cleanup failed. Those tasks are correctly evicted from the queue. The hang is caused by tasks that never reach COMPLETED state.
Steps to Reproduce
- Run a Nextflow pipeline with
errorStrategy 'finish' and maxRetries on Azure Batch
- Have one task fail (triggering the FINISH error strategy)
- While other tasks are still running, cause the Azure Batch node to become unhealthy (e.g., node preemption, hardware failure)
- Observe: Nextflow logs show the failed task but never exits. The process hangs indefinitely.
Expected Behavior
Nextflow should eventually exit after a reasonable timeout, killing remaining tasks that cannot complete.
Suggested Fixes
Primary fix — add a cancel timeout to the framework (fixes all executors):
Add a configurable timeout to the cancel() shutdown path. If monitorsBarrier doesn't complete within the timeout, escalate to abort() or forceTermination(). This could be implemented as a timeout parameter on Barrier.awaitCompletion() or a watchdog thread in Session.await().
Defense-in-depth — detect unhealthy nodes in cloud handlers:
Azure and Google Batch handlers should detect tasks stuck on unhealthy nodes and mark them as failed, similar to how K8sTaskHandler handles NodeTerminationException. In AzBatchTaskHandler.taskState0(), check node health when a task remains ACTIVE/RUNNING beyond a threshold and treat unhealthy-node tasks as failed.
Affected Versions
Believed to affect all current versions. The cancel() / abort() asymmetry has been present since the FINISH error strategy was introduced.
Environment
- Executor: Azure Batch (confirmed), Google Batch (likely), AWS Batch (less likely due to service-level node management)
- Kubernetes: not affected (has explicit node failure handling)
Bug
When using
errorStrategy 'finish'with the Azure Batch executor, Nextflow hangs indefinitely if worker nodes become unhealthy (e.g.,NodeNotReady) while tasks are running. The JVM never exits. The only recovery iskill -9or Ctrl+C (which triggersabort()via signal handler, bypassing the bug).This also affects Google Batch. AWS Batch is partially protected by its service-level node management. Kubernetes is immune due to explicit
NodeTerminationExceptionhandling.Root Cause
The hang is caused by a design gap in
Session.cancel()— the shutdown path used byerrorStrategy 'finish'.When a task fails and exhausts
maxRetries,Session.fault()routes tocancel()(notabort()) for the FINISH error strategy.cancel()setscancelled=trueand forcesprocessesBarrier, but critically:monitorsBarriershutdown0()(which runs cleanup hooks includingTaskPollingMonitor.cleanup())This creates a deadlock:
Session.await()blocks onmonitorsBarrier.awaitCompletion()— awhile(true)loop with no timeoutmonitorsBarrierwaits forTaskPollingMonitor.pollLoop()to exit and callarrive()pollLoop()break condition when cancelled:session.isCancelled() && runningQueue.size() == 0runningQueuenever drains →pollLoopnever exits →monitorsBarriernever completesSession.destroy()callsshutdown0()but only afterawait()returns — which it never doesTaskPollingMonitor.cleanup()(which kills remaining tasks) is registered as a shutdown hook and never runsBy contrast,
Session.abort()forces both barriers and callsshutdown0(), so it never hangs.Clarification on 409 Errors
The
Unable to cleanup batch taskwarnings (HTTP 409 / NodeNotReady) visible in logs are a cosmetic issue, not the hang cause. These come fromdeleteTask()for tasks that did complete but whose cleanup failed. Those tasks are correctly evicted from the queue. The hang is caused by tasks that never reach COMPLETED state.Steps to Reproduce
errorStrategy 'finish'andmaxRetrieson Azure BatchExpected Behavior
Nextflow should eventually exit after a reasonable timeout, killing remaining tasks that cannot complete.
Suggested Fixes
Primary fix — add a cancel timeout to the framework (fixes all executors):
Add a configurable timeout to the
cancel()shutdown path. IfmonitorsBarrierdoesn't complete within the timeout, escalate toabort()orforceTermination(). This could be implemented as a timeout parameter onBarrier.awaitCompletion()or a watchdog thread inSession.await().Defense-in-depth — detect unhealthy nodes in cloud handlers:
Azure and Google Batch handlers should detect tasks stuck on unhealthy nodes and mark them as failed, similar to how
K8sTaskHandlerhandlesNodeTerminationException. InAzBatchTaskHandler.taskState0(), check node health when a task remains ACTIVE/RUNNING beyond a threshold and treat unhealthy-node tasks as failed.Affected Versions
Believed to affect all current versions. The
cancel()/abort()asymmetry has been present since the FINISH error strategy was introduced.Environment