-
Notifications
You must be signed in to change notification settings - Fork 22
feat: Add optional max_walltime to prevent infinite looping in Slurm jobs #638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add optional max_walltime to prevent infinite looping in Slurm jobs #638
Conversation
| # Read start time and calculate elapsed time | ||
| _job_chain_start_time=$(cat "$_start_time_file") | ||
| _current_time=$(date +%s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this correctly, we're not calculating the right thing here. It doesn't calculate walltime, but a total time that has passed since the job was started, including the time that the resumed job has spent in the queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed that with sacct :)
| ntasks_per_node: 1 | ||
| gres: gpu:8 | ||
| walltime: 01:00:00 | ||
| max_walltime: null # Maximum total wall-clock time across all resumes (e.g., "24:00:00"). null = unlimited. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider setting it to some high default (eg 48h). Users pointed out that the current setup is error prone.
I think that is the user forgets to increase it will cause less damage than forgetting to set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we agree on 72h as default? Some benchmarks can really last 48h, I would like to minimize the risk of us not delivering on time because someone forgot to increase max_walltime :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
72 sounds good, I just want something < infinity :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
fcad123 to
2e7da09
Compare
|
/ok to test c70c92c |
|
/ok to test 195a477 |
|
/ok to test d95b95f |
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…t queue time Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
d95b95f to
b7706d3
Compare
|
/ok to test b7706d3 |
No description provided.