-
Notifications
You must be signed in to change notification settings - Fork 7
Fix hung_task_timeout_secs issue #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Has this been shown to work? |
Haven't had any issues so far since it was applied |
with_items: | ||
- { name: kernel.panic, value: 5 } | ||
- name: kernel.hung_task_timeout_secs | ||
value: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely going to make the message go away, since it stops the kernel checking for hung tasks. Without the message, we're not going to know if we fixed the underlying issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm well we can always put it back if we want to debug.... its just quite annoying as this is part of the reason why nodes won't reboot automatically. By default, we dont want things to hang... this should be an opt in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this option just controls whether the kernel reports that tasks have hung, it doesn't affect whether tasks will hang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what is preventing automatic reboot though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just for reporting, see https://www.kernel.org/doc/Documentation/sysctl/kernel.txt.
This article suggests you can panic on hung tasks by setting hung kernel.hung_task_panic=1, from kernel 2.6.35. Then to cause the kernel to reboot on panic, you set kernel.panic= (as you have done).
5f7faa8
to
b1b77f3
Compare
We are repeatedly seeing this kernel panic and this is probably a solution... it is difficult to replicate as it happens randomly on the different nodes when running slurm jobs... lets see if this fixes the problem (this has a good explanation for why this could work: https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/)
Log snippet: