Airflow tasks hang in 'Running' state #19587
Replies: 8 comments 8 replies
-
|
Thanks for opening your first issue here! Be sure to follow the issue template! |
Beta Was this translation helpful? Give feedback.
-
|
I think this is something with your tasks. Looks like they simply are hanging while doing whatever they are doing - so they are not "stuck" but they are simply "running". I think you need to take a close look at your tasks and see whether they are not doing something you do not expect them to do - for example, not closing a connection and waiting, etc. Or maybe they are not starting at all but wait for a lock that is never granted to them. I think you will need to share with us more details about what your tasks are doing and checking some details about them. I will convert it to discussion until we get more info about it. |
Beta Was this translation helpful? Give feedback.
-
|
Ok, we'll try adding further logging to our tasks to see if there's something strange happening that causes them to hang, but based on the extra logging that we've added so far it seems that our code is not being executed before the task gets stuck. Quick question - is it normal to have 2 task supervisors for a single worker? Our first thought when we saw that was maybe one of the supervisors was waiting for notification from the task, but the task was trying to inform (or had informed) the other supervisor that it was ready to proceed. I'll update the discussion with any further logging that we can obtain. Thanks. |
Beta Was this translation helpful? Give feedback.
-
|
After running a few more DAGs today we managed to catch one where the task that had hung was an EmrStepSensor, the logged output appears to show that the task has stopped at the same point, or at least somewhere after the same log statement: The process list again shows a single worker, two supervisors but no runner for the task. I don't think this task calls any of our code as it should just be checking on the state of a step in an EMR job flow, is that correct? |
Beta Was this translation helpful? Give feedback.
-
|
After another day of running DAGs to try and figure out how to reproduce the issue we've drawn a few conclusions:
So I have three questions:
Thanks. ps output |
Beta Was this translation helpful? Give feedback.
-
|
Our latest investigations appear to point towards this being caused by enabling remote logging to CloudWatch. We've run the DAG with remote logging disabled entirely and also configured it to log to S3 and haven't seen a task hang with those configurations so far today. We've seen well over 50 successful runs with remote logging disabled without an error and something like 17 with S3 logging enabled, which is reasonably positive given that we managed to build a DAG that would fail roughly 1 in 3 times. Any thoughts on what to do next? |
Beta Was this translation helpful? Give feedback.
-
|
Same issue happening with the same config (remote logging to CloudWatch enabled). @potiuk which would be the steps to re open the issue #19535 ? |
Beta Was this translation helpful? Give feedback.
-
|
Sorry to necro an old thread here, but this may help others: In version 3.19.0 (Released 2025-11-21) of dd-trace-py, our pods were getting stuck with a deadlock in forked sub-processes caused by the Datadog trace library. Reverting back to 3.18.1 and pinning to that version solved the issue. This is the line where we saw the pods getting stuck, which is a method from the NativeWriter class. The 3.19 release bumped libdatadog to a new major version (24.0.0) and made NativeWriter the default in the Nov 13th commits. The NativeWriter is implemented in Rust + tokio. This is the underlying rust code implementation: https://github.com/DataDog/libdatadog/blob/3445414c9ba4fefc76be46cf7e2f998986592892/libdd-data-pipeline/src/trace_exporter/mod.rs#L302 self.runtime and self.workers are protected behind an Arc<Mutex<>> Rust's std::sync::Mutex is built on POSIX pthread_mutex_t on Unix systems, which is not fork-safe. TL;DR, there may be some third party non-fork-safe observability or add-ins (such as the AWS one in the thread above, or the Datadog on here) which cause deadlocks in forked sub-processes, which makes it appear as if Airflow spawns "stuck" processes, so if in doubt, check anything that might be observing these tasks as they spawn. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Apache Airflow version
2.2.1 (latest released)
Operating System
Amazon Linux 2
Versions of Apache Airflow Providers
apache_airflow-2.2.1.dist-info
apache_airflow_providers_amazon-2.3.0.dist-info
apache_airflow_providers_ftp-2.0.1.dist-info
apache_airflow_providers_http-2.0.1.dist-info
apache_airflow_providers_imap-2.0.1.dist-info
apache_airflow_providers_postgres-2.3.0.dist-info
apache_airflow_providers_sqlite-2.0.1.dist-info
Deployment
Other
Deployment details
pip3 installation to an Amazon Linux 2 based AMI running on an EC2 instance (t3.xlarge)
What happened
Our DAG runs 8 parallel tasks on 32 LocalExecutors using the
os.forkoption for execution.We use custom operators based on Amazon providers to add Steps to an EMR Cluster.
Seemingly randomly, tasks get stuck in the Running state, producing little log output.
We can see from running
ps axthat these "stuck" tasks have two task supervisor processes running.Retrying the task, or killing one of these supervisor processes causes Airflow to retry the task which then runs successfully.
This is an intermittent issue that occurs seemingly randomly that renders Airflow unusable in a production setting.
Running processes:
Airflow UI task log for a hung task:
Airflow UI task log for a successful task:
What you expected to happen
Task should not hang in 'Running' state
How to reproduce
We have found that this issue occurs more often as the number of parallel tasks increases.
Anything else
Roughly one DAG run in 10 with 8 parallel tasks and 40+ pending tasks
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions