This repository was archived by the owner on Oct 18, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 12
Scheduler has to be restarted every few daysΒ #4
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
This seemed to have occurred on October 8th and then on October 23rd, 2018. We start seeing messages like this in the logs:
When investigating the ECS instance, seems like s3fs has also failed on one of the instances:
$ mount -l | grep airflow
s3fs on /mnt/prod-airflow-ecs type fuse.s3fs (rw,noatime,user_id=0,group_id=0,allow_other)
$ sudo ls /mnt/prod-airflow-ecs/
ls: cannot access /mnt/prod-airflow-ecs/: Transport endpoint is not connected
The second instance is fine:
[ec2-user@ip-10-0-10-132 ~]$ mount -l | grep airflow
s3fs on /mnt/prod-airflow-ecs type fuse.s3fs (rw,noatime,user_id=0,group_id=0,allow_other)
[ec2-user@ip-10-0-10-132 ~]$ sudo ls /mnt/prod-airflow-ecs/
dags plugins
Currently we don't have monitoring to detect this and the only recourse is to stop the Scheduler task in ECS and it bounces back and runs the scheduled jobs it missed. If the scheduler task restart doesn't work, terminating the ECS instances helps but it does take time to failover and restart all the instances.
The first occurrence on October 8th was due to issues with the docker volume mount but it wasn't investigated in-depth at the time.
calvinlfer and mithunmanohar
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working

