Scheduler has to be restarted every few days

This seemed to have occurred on October 8th and then on October 23rd, 2018. We start seeing messages like this in the logs:

![image](https://user-images.githubusercontent.com/19393946/47441507-45b2d880-d77e-11e8-88c9-a0c1c236b00f.png)

![image](https://user-images.githubusercontent.com/19393946/47441563-65e29780-d77e-11e8-88d3-9d2a1954c8f9.png)

When investigating the ECS instance, seems like s3fs has also failed on one of the instances:
```
$ mount -l | grep airflow
s3fs on /mnt/prod-airflow-ecs type fuse.s3fs (rw,noatime,user_id=0,group_id=0,allow_other)
$ sudo ls /mnt/prod-airflow-ecs/
ls: cannot access /mnt/prod-airflow-ecs/: Transport endpoint is not connected
```

The second instance is fine:
```
[ec2-user@ip-10-0-10-132 ~]$ mount -l | grep airflow
s3fs on /mnt/prod-airflow-ecs type fuse.s3fs (rw,noatime,user_id=0,group_id=0,allow_other)
[ec2-user@ip-10-0-10-132 ~]$ sudo ls /mnt/prod-airflow-ecs/
dags  plugins
```

Currently we don't have monitoring to detect this and the only recourse is to stop the Scheduler task in ECS and it bounces back and runs the scheduled jobs it missed. If the scheduler task restart doesn't work, terminating the ECS instances helps but it does take time to failover and restart all the instances.

The first occurrence on October 8th was due to issues with the docker volume mount but it wasn't investigated in-depth at the time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scheduler has to be restarted every few days #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scheduler has to be restarted every few days #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions