Skip to content
This repository was archived by the owner on Oct 18, 2023. It is now read-only.

Scheduler has to be restarted every few daysΒ #4

@hackmad

Description

@hackmad

This seemed to have occurred on October 8th and then on October 23rd, 2018. We start seeing messages like this in the logs:

image

image

When investigating the ECS instance, seems like s3fs has also failed on one of the instances:

$ mount -l | grep airflow
s3fs on /mnt/prod-airflow-ecs type fuse.s3fs (rw,noatime,user_id=0,group_id=0,allow_other)
$ sudo ls /mnt/prod-airflow-ecs/
ls: cannot access /mnt/prod-airflow-ecs/: Transport endpoint is not connected

The second instance is fine:

[ec2-user@ip-10-0-10-132 ~]$ mount -l | grep airflow
s3fs on /mnt/prod-airflow-ecs type fuse.s3fs (rw,noatime,user_id=0,group_id=0,allow_other)
[ec2-user@ip-10-0-10-132 ~]$ sudo ls /mnt/prod-airflow-ecs/
dags  plugins

Currently we don't have monitoring to detect this and the only recourse is to stop the Scheduler task in ECS and it bounces back and runs the scheduled jobs it missed. If the scheduler task restart doesn't work, terminating the ECS instances helps but it does take time to failover and restart all the instances.

The first occurrence on October 8th was due to issues with the docker volume mount but it wasn't investigated in-depth at the time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions