Airflow tasks hang in 'Running' state #19587

racingjellyfish · 2021-11-11T11:15:35Z

racingjellyfish
Nov 11, 2021

Apache Airflow version

2.2.1 (latest released)

Operating System

Amazon Linux 2

Versions of Apache Airflow Providers

apache_airflow-2.2.1.dist-info
apache_airflow_providers_amazon-2.3.0.dist-info
apache_airflow_providers_ftp-2.0.1.dist-info
apache_airflow_providers_http-2.0.1.dist-info
apache_airflow_providers_imap-2.0.1.dist-info
apache_airflow_providers_postgres-2.3.0.dist-info
apache_airflow_providers_sqlite-2.0.1.dist-info

Deployment

Other

Deployment details

pip3 installation to an Amazon Linux 2 based AMI running on an EC2 instance (t3.xlarge)

What happened

Our DAG runs 8 parallel tasks on 32 LocalExecutors using the os.fork option for execution.
We use custom operators based on Amazon providers to add Steps to an EMR Cluster.

Seemingly randomly, tasks get stuck in the Running state, producing little log output.
We can see from running ps ax that these "stuck" tasks have two task supervisor processes running.
Retrying the task, or killing one of these supervisor processes causes Airflow to retry the task which then runs successfully.

This is an intermittent issue that occurs seemingly randomly that renders Airflow unusable in a production setting.

Running processes:

21861 ?        S      0:00 airflow worker -- LocalExecutor: ['airflow', 'tasks', 'run', 'dag1', 'one.four.five.task', 'backfill__2021-11-11T02:00:00+00:00', '--mark-success', '--local', '--pool', 'default_pool', '--subdir', 'DAGS_FOLDER/dag1.py', '--cfg-path', '/tmp/tmpb6gpqufk']
23900 ?        Sl     0:05 airflow task supervisor: ['airflow', 'tasks', 'run', 'dag1', 'one.four.five.task', 'backfill__2021-11-11T02:00:00+00:00', '--mark-success', '--local', '--pool', 'default_pool', '--subdir', 'DAGS_FOLDER/dag1.py', '--cfg-path', '/tmp/tmpb6gpqufk']
24034 ?        S      0:00 airflow task supervisor: ['airflow', 'tasks', 'run', 'dag1', 'one.four.five.task', 'backfill__2021-11-11T02:00:00+00:00', '--mark-success', '--local', '--pool', 'default_pool', '--subdir', 'DAGS_FOLDER/dag1.py', '--cfg-path', '/tmp/tmpb6gpqufk']

Airflow UI task log for a hung task:

*** Reading remote log from Cloudwatch log_group: log-group-01 log_stream: dag1/one.four.five.task/2021-11-11T02_00_00+00_00/1.log.
[2021-11-11, 09:08:34 UTC] 
--------------------------------------------------------------------------------
[2021-11-11, 09:08:34 UTC] Starting attempt 1 of 2
[2021-11-11, 09:08:34 UTC] 
--------------------------------------------------------------------------------
[2021-11-11, 09:08:34 UTC] Marking success for <Task(EmrAddStepsOperator): one.four.five.task> on 2021-11-11 02:00:00+00:00
[2021-11-11, 09:08:34 UTC] Started process 24034 to run task

Airflow UI task log for a successful task:

*** Reading remote log from Cloudwatch log_group: log-group-01 log_stream: dag1/one.two.three.task/2021-11-11T02_00_00+00_00/1.log.
[2021-11-11, 09:07:29 UTC] 
--------------------------------------------------------------------------------
[2021-11-11, 09:07:29 UTC] Starting attempt 1 of 2
[2021-11-11, 09:07:29 UTC] 
--------------------------------------------------------------------------------
[2021-11-11, 09:07:29 UTC] Marking success for <Task(EmrAddStepsOperator): one.two.three.task> on 2021-11-11 02:00:00+00:00
[2021-11-11, 09:07:29 UTC] Started process 22814 to run task
[2021-11-11, 09:07:32 UTC] Running <TaskInstance: dag1.one.two.three.task backfill__2021-11-11T02:00:00+00:00 [running]> on host ip-REDACTED.eu-west-2.compute.internal
[2021-11-11, 09:07:32 UTC] Marking task as SUCCESS. dag_id=dag1, task_id=one.two.three.task, execution_date=20211111T020000, start_date=20211111T090518, end_date=20211111T090732
[2021-11-11, 09:07:34 UTC] State of this instance has been externally set to success. Terminating instance.
[2021-11-11, 09:07:34 UTC] Sending Signals.SIGTERM to GPID 22814
[2021-11-11, 09:07:34 UTC] Process psutil.Process(pid=22814, status='terminated', exitcode=<Negsignal.SIGTERM: -15>, started='09:07:29') (22814) terminated with exit code Negsignal.SIGTERM

What you expected to happen

Task should not hang in 'Running' state

How to reproduce

We have found that this issue occurs more often as the number of parallel tasks increases.

Setup Airflow to run with the LocalExecutor with a parallelism of 32
Create a DAG with 8+ parallel, long-running tasks
Run the DAG repeatedly until a task gets stuck in the 'Running' state

Anything else

Roughly one DAG run in 10 with 8 parallel tasks and 40+ pending tasks

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

2021-11-11T11:15:36Z

boring-cyborg[bot]
bot Nov 11, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

0 replies

potiuk · 2021-11-15T00:04:32Z

potiuk
Nov 15, 2021
Collaborator

I think this is something with your tasks. Looks like they simply are hanging while doing whatever they are doing - so they are not "stuck" but they are simply "running". I think you need to take a close look at your tasks and see whether they are not doing something you do not expect them to do - for example, not closing a connection and waiting, etc.

Or maybe they are not starting at all but wait for a lock that is never granted to them.

I think you will need to share with us more details about what your tasks are doing and checking some details about them.

I will convert it to discussion until we get more info about it.

0 replies

racingjellyfish · 2021-11-15T09:39:11Z

racingjellyfish
Nov 15, 2021
Author

Ok, we'll try adding further logging to our tasks to see if there's something strange happening that causes them to hang, but based on the extra logging that we've added so far it seems that our code is not being executed before the task gets stuck.

Quick question - is it normal to have 2 task supervisors for a single worker? Our first thought when we saw that was maybe one of the supervisors was waiting for notification from the task, but the task was trying to inform (or had informed) the other supervisor that it was ready to proceed.

I'll update the discussion with any further logging that we can obtain.

Thanks.

1 reply

potiuk Nov 15, 2021
Collaborator

Not sure of that, but it's quite likely when forking is involved. You can check the other tasks that are successfully completed or when you re-run- the pattern there should be the same - you can see also the parent-child relationship for them to see if those are the results of "forking".

racingjellyfish · 2021-11-15T16:29:42Z

racingjellyfish
Nov 15, 2021
Author

After running a few more DAGs today we managed to catch one where the task that had hung was an EmrStepSensor, the logged output appears to show that the task has stopped at the same point, or at least somewhere after the same log statement:

--------------------------------------------------------------------------------
[2021-11-15, 11:17:02 UTC] Starting attempt 1 of 2
[2021-11-15, 11:17:02 UTC] 
--------------------------------------------------------------------------------
[2021-11-15, 11:17:02 UTC] Executing <Task(EmrStepSensor): one.two.three.three_wait> on 2021-11-15 10:52:00.711270+00:00
[2021-11-15, 11:17:02 UTC] Started process 31161 to run task

The process list again shows a single worker, two supervisors but no runner for the task.

I don't think this task calls any of our code as it should just be checking on the state of a step in an EMR job flow, is that correct?

0 replies

racingjellyfish · 2021-11-17T09:31:06Z

racingjellyfish
Nov 17, 2021
Author

After another day of running DAGs to try and figure out how to reproduce the issue we've drawn a few conclusions:

including the parent PID in the ps output (see output below) appears to show that the worker process forks to create a task supervisor process, this second process then forks again to create a second supervisor process which, from our reading of the source code, should go on to convert itself into a task runner process, this conversion never seems to happen,
based on extra logging added to standard_task_runner.py, the second supervisor task never appears to get past this line, which seems to confirm the above observation,
this leads us to the conclusion that the actual operator being called by the task has no bearing on the issue so we'll therefore look to recreate the issue without involving an EMR cluster, which should hopefully increase our chances of presenting a set of reproduction steps.

So I have three questions:

does this analysis sound plausible?
is there anything you could suggest to help narrow down our investigation?
is it worth us pursuing the non-EMR avenue?

Thanks.

ps output

PID PPID Blocked Stat Time Command
4360  4219 0000000000000000 S    00:00:00 airflow worker -- LocalExecutor: ['airflow', 'tasks', 'run', 'dag1', 'one.two.three.four_wait', 'manual__2021-11-16T15:35:59.269576+00:00', '--local', '--subdir', 'DAGS_FOLDER/dag1.py']
32274  4360 0000000000000000 Sl   00:00:10 airflow task supervisor: ['airflow', 'tasks', 'run', 'dag1', 'one.two.three.four_wait', 'manual__2021-11-16T15:35:59.269576+00:00', '--local', '--subdir', 'DAGS_FOLDER/dag1.py']
32351 32274 0000000000000000 S    00:00:00 airflow task supervisor: ['airflow', 'tasks', 'run', 'dag1', 'one.two.three.four_wait', 'manual__2021-11-16T15:35:59.269576+00:00', '--local', '--subdir', 'DAGS_FOLDER/dag1.py']

0 replies

racingjellyfish · 2021-11-18T16:59:59Z

racingjellyfish
Nov 18, 2021
Author

Our latest investigations appear to point towards this being caused by enabling remote logging to CloudWatch. We've run the DAG with remote logging disabled entirely and also configured it to log to S3 and haven't seen a task hang with those configurations so far today.

We've seen well over 50 successful runs with remote logging disabled without an error and something like 17 with S3 logging enabled, which is reasonably positive given that we managed to build a DAG that would fail roughly 1 in 3 times.

Any thoughts on what to do next?

0 replies

jorgebaez212 · 2022-02-10T13:03:49Z

jorgebaez212
Feb 10, 2022

Same issue happening with the same config (remote logging to CloudWatch enabled).
@racingjellyfish has the error happened again after configuring/using s3 logging instead?

@potiuk which would be the steps to re open the issue #19535 ?

7 replies

potiuk Feb 10, 2022
Collaborator

i'm not aware of it happening since we reconfigured to use S3 and, as it was a fairly common occurrence for our testers, I'm reasonably confident that the issue must be with the remote logging to CloudWatch or the underlying library

Would it be possible to create an issue describing it @racingjellyfish - if there are also any logs which you could share (well I understand you might not have them due to the error being .... errr... logging isue) - maybe there is something that we or Amazon team can help with in this regard. CloudWatch is better than S3 because it can stream logs and you can watch them "live" so might be worth taking a look at.

potiuk Feb 10, 2022
Collaborator

@potiuk which would be the steps to re open the issue #19535 ?

If the reproduction is known (seems it is) the best is to open a new one with fresh details and posting as much information as possible to be able to reproduce it. POssibly togetehr or extendingt the @racingjellyfish issue if it gets created. I am sure things has changed since and "fresh" issue will likely contain more details and better info rather than reopened one from two months back.

potiuk Feb 10, 2022
Collaborator

But you can refer to the issue there (and to this discussion). Again - the more fresh logs and info, the better.

jorgebaez212 Feb 10, 2022

@potiuk which would be the steps to re open the issue #19535 ?

If the reproduction is known (seems it is) the best is to open a new one with fresh details and posting as much information as possible to be able to reproduce it. POssibly togetehr or extendingt the @racingjellyfish issue if it gets created. I am sure things has changed since and "fresh" issue will likely contain more details and better info rather than reopened one from two months back.

Well, the problem is, we were seeing these random stuck processes, exactly as described by @racingjellyfish, and with the same overall lack of errors appearing on the logs, which makes it quite difficult to debug. In fact, we thought that it might be related to the Airflow version we were using, but using v2.2.3 it is still randomly happening every couple of days.

We'll use the s3 logs on prod to gain stability and probably leave some processes running on a separate environment to see if we can provide any helpful logs for this issue.

potiuk Feb 10, 2022
Collaborator

Well, the problem is, we were seeing these random stuck processes, exactly as described by @racingjellyfish, and with the same overall lack of errors appearing on the logs, which makes it quite difficult to debug. In fact, we thought that it might be related to the Airflow version we were using, but using v2.2.3 it is still randomly happening every couple of days.

Yeah I understand that in this case lack of logs is "reasonable" but confirming that switching Cloud Watch -> S3 fixes it and switching bag S3 -> CloudWatch brings it back (if this is the only changed thing), for me is more than having logs - because this is a very nicely isolated cases that has a very clear proable case and allows to narrow the search. Before the last staement we even had no idea it might be connected to cloudwatch. So we are already in much better place to try to help you (mind - this is a free software and all a lot of people does here is basically done as voluntary time).

In this case I would expect information similar to:

explaining and providing details of your CloudWatch/S3 configuration change (ideally a diff before and after)
explaining your deployment (i.e. where you run, docker, k8S, AWS etc. etc. ),
describing what is the network architectecture between you and cloudwatch
telling what is your network topology and limitations -
is it an enterprise environment with some zone/network isolations etc.)

This might help to either disover the issue (in the cloudwatch library or our implementation or handler or maybe even give us indication that there are environmental things in action). And with all that information we can either solve it or maybe mitigate the issue (and even - we can provide a possible fix that you will be able to try in your environment, because I am 100% sure we will not be able to reproduce the very exact behaviour and problem, so we will certainly ask you for help and verification. But we need to have enough data to be able to analyse it an get at least a hypothesis on what is wrong.

For example I can very easily imagine that the problem might be in firewall between the workers of your Cloud Watch is constantly streaming logs and it might be that some combination of library bug and firewall configuration might make the library go into an endless loop. But the cloudwatch "hypothesis" needs to be quite proable and reproduced in order to go that route. And maybe - if we see what kind of networking and connectivity you have, we will be able to come up with a hypothesis on how to reconfigure the library (for example decrease or set some timeouts that we might "guess" is the case.

But again - reproducible case, ideally diff in configuration that you've changed and tried back and forth and have provent that it works (+ all the information about what's between your worker and cloudwatch) is an abslute minimum to start investigating.

IanMoroney · 2025-12-02T12:44:35Z

IanMoroney
Dec 2, 2025

Sorry to necro an old thread here, but this may help others:

In version 3.19.0 (Released 2025-11-21) of dd-trace-py, our pods were getting stuck with a deadlock in forked sub-processes caused by the Datadog trace library. Reverting back to 3.18.1 and pinning to that version solved the issue.

https://github.com/DataDog/dd-trace-py/blob/b3855747420b6f074418c78fe267721270281b81/ddtrace/internal/writer/writer.py#L857C1-L857C37

        # Stop the trace exporter worker
        self._exporter.stop_worker()

This is the line where we saw the pods getting stuck, which is a method from the NativeWriter class.

The 3.19 release bumped libdatadog to a new major version (24.0.0) and made NativeWriter the default in the Nov 13th commits.

# ddtrace/internal/writer/writer.py
def create_trace_writer(...) -> TraceWriter:
    if _use_log_writer():
        return LogWriter()

    if config._trace_writer_native:  # DEFAULT IS NOW TRUE IN 3.19.0
        return NativeWriter(...)     # ← This has the fork bug
    else:
        return AgentWriter(...)      # ← This was used before

The NativeWriter is implemented in Rust + tokio. This is the underlying rust code implementation: https://github.com/DataDog/libdatadog/blob/3445414c9ba4fefc76be46cf7e2f998986592892/libdd-data-pipeline/src/trace_exporter/mod.rs#L302

    pub fn stop_worker(&self) {
        let runtime = self.runtime.lock_or_panic().take();
        if let Some(ref rt) = runtime {
            // Stop workers to save their state
            let mut workers = self.workers.lock_or_panic();
            rt.block_on(async {
                let _ = workers.info.pause().await;
                if let Some(stats_worker) = &mut workers.stats {
                    let _ = stats_worker.pause().await;
                };
                if let Some(telemetry_worker) = &mut workers.telemetry {
                    let _ = telemetry_worker.pause().await;
                };
            });
        }
        // When the info fetcher is paused, the trigger channel keeps a reference to the runtime's
        // IoStack as a waker. This prevents the IoStack from being dropped when shutting
        // down runtime. By manually sending a message to the trigger channel we trigger the
        // waker releasing the reference to the IoStack. Finally we drain the channel to
        // avoid triggering a fetch when the info fetcher is restarted.
        if let PausableWorker::Paused { worker } = &mut self.workers.lock_or_panic().info {
            self.info_response_observer.manual_trigger();
            worker.drain();
        }
        drop(runtime);
    }

self.runtime and self.workers are protected behind an Arc<Mutex<>>

  pub struct TraceExporter {
      runtime: Arc<Mutex<Option<Arc<Runtime>>>>,
      workers: Arc<Mutex<TraceExporterWorkers>>,
  }

Rust's std::sync::Mutex is built on POSIX pthread_mutex_t on Unix systems, which is not fork-safe.
We also have reason to believe that tokio itself is also not fork safe.
tokio-rs/tokio#1541
tokio-rs/tokio#4301

TL;DR, there may be some third party non-fork-safe observability or add-ins (such as the AWS one in the thread above, or the Datadog on here) which cause deadlocks in forked sub-processes, which makes it appear as if Airflow spawns "stuck" processes, so if in doubt, check anything that might be observing these tasks as they spawn.

0 replies

Airflow tasks hang in 'Running' state #19587

Uh oh!

racingjellyfish Nov 11, 2021

Apache Airflow version

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

What happened

What you expected to happen

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Replies: 8 comments · 8 replies

Uh oh!

boring-cyborg[bot] bot Nov 11, 2021

Uh oh!

potiuk Nov 15, 2021 Collaborator

Uh oh!

racingjellyfish Nov 15, 2021 Author

Uh oh!

potiuk Nov 15, 2021 Collaborator

Uh oh!

racingjellyfish Nov 15, 2021 Author

Uh oh!

racingjellyfish Nov 17, 2021 Author

Uh oh!

racingjellyfish Nov 18, 2021 Author

Uh oh!

jorgebaez212 Feb 10, 2022

Uh oh!

potiuk Feb 10, 2022 Collaborator

Uh oh!

potiuk Feb 10, 2022 Collaborator

Uh oh!

potiuk Feb 10, 2022 Collaborator

Uh oh!

jorgebaez212 Feb 10, 2022

Uh oh!

Uh oh!

potiuk Feb 10, 2022 Collaborator

Uh oh!

IanMoroney Dec 2, 2025

racingjellyfish
Nov 11, 2021

Replies: 8 comments 8 replies

boring-cyborg[bot]
bot Nov 11, 2021

potiuk
Nov 15, 2021
Collaborator

racingjellyfish
Nov 15, 2021
Author

potiuk Nov 15, 2021
Collaborator

racingjellyfish
Nov 15, 2021
Author

racingjellyfish
Nov 17, 2021
Author

racingjellyfish
Nov 18, 2021
Author

jorgebaez212
Feb 10, 2022

potiuk Feb 10, 2022
Collaborator

potiuk Feb 10, 2022
Collaborator

potiuk Feb 10, 2022
Collaborator

potiuk Feb 10, 2022
Collaborator

IanMoroney
Dec 2, 2025