Skip to content

Conversation

@wprazuch
Copy link
Contributor

No description provided.

@wprazuch wprazuch requested review from a team as code owners January 16, 2026 14:38
@github-actions github-actions bot added documentation Improvements or additions to documentation nemo-evaluator-launcher tests labels Jan 16, 2026
@wprazuch wprazuch changed the title feature: Add optional max_walltime to prevent infinite looping in Slurm jobs feat: Add optional max_walltime to prevent infinite looping in Slurm jobs Jan 16, 2026
# Read start time and calculate elapsed time
_job_chain_start_time=$(cat "$_start_time_file")
_current_time=$(date +%s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this correctly, we're not calculating the right thing here. It doesn't calculate walltime, but a total time that has passed since the job was started, including the time that the resumed job has spent in the queue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed that with sacct :)

ntasks_per_node: 1
gres: gpu:8
walltime: 01:00:00
max_walltime: null # Maximum total wall-clock time across all resumes (e.g., "24:00:00"). null = unlimited.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider setting it to some high default (eg 48h). Users pointed out that the current setup is error prone.

I think that is the user forgets to increase it will cause less damage than forgetting to set it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we agree on 72h as default? Some benchmarks can really last 48h, I would like to minimize the risk of us not delivering on time because someone forgot to increase max_walltime :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

72 sounds good, I just want something < infinity :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wprazuch wprazuch force-pushed the wprazuch/optional-max-total-time-limit-for-slurm-jobs branch from fcad123 to 2e7da09 Compare January 21, 2026 13:56
@wprazuch
Copy link
Contributor Author

/ok to test c70c92c

@wprazuch
Copy link
Contributor Author

/ok to test 195a477

@wprazuch
Copy link
Contributor Author

/ok to test d95b95f

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…t queue time

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@wprazuch wprazuch force-pushed the wprazuch/optional-max-total-time-limit-for-slurm-jobs branch from d95b95f to b7706d3 Compare January 22, 2026 10:06
@wprazuch
Copy link
Contributor Author

/ok to test b7706d3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nemo-evaluator-launcher tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants