Skip to content

[BUG] SLURM autodetection causes OF3 to hang on compute cluster #47

@reedharrison

Description

@reedharrison

Describe the bug
When an OpenFold3 job is run using SLURM, the SLURM job ends up spawning an excessive number of processes (likely due to the SLURM configuration). It would be preferable to disable the SLURMEnvironment and only use local resources. In my case, the job has been submitted to a VM with finite resources and the OpenFold3 job oversubscribes the CPU/GPU leading to the job appearing to hang on the cluster.

To Reproduce
Reproducing the issue may be difficult if you don't have a compute cluster configured in a similar way; however, the SLURM autodetection is a common problem with Pytorch Lightning. Note that my OpenFold3 job is not run in Docker. While that might be a solution for me in the long term, it would be nice if we could prevent SLURM autodetection as command-line argument.

Expected behavior
To avoid this issue, the run command could accept one or more arguments to disable SLURM autodetection or to manually specify the local resources to be used.

Metadata

Metadata

Assignees

No one assigned

    Labels

    OpenFold Consortium MemberUse this tag if you are a member of the OpenFold Consortium to receive higher prioritybugSomething isn't workinginferenceRelating to the inference pipeline

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions