Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion ansible/roles/cluster_setup/tasks/sysctl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,12 @@
reload: yes
sysctl_set: yes
with_items:
- { name: kernel.panic, value: 5 }
- name: kernel.hung_task_timeout_secs
value: 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely going to make the message go away, since it stops the kernel checking for hung tasks. Without the message, we're not going to know if we fixed the underlying issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm well we can always put it back if we want to debug.... its just quite annoying as this is part of the reason why nodes won't reboot automatically. By default, we dont want things to hang... this should be an opt in.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this option just controls whether the kernel reports that tasks have hung, it doesn't affect whether tasks will hang.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what is preventing automatic reboot though?

Copy link

@markgoddard markgoddard Apr 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just for reporting, see https://www.kernel.org/doc/Documentation/sysctl/kernel.txt.

This article suggests you can panic on hung tasks by setting hung kernel.hung_task_panic=1, from kernel 2.6.35. Then to cause the kernel to reboot on panic, you set kernel.panic= (as you have done).

- name: kernel.panic
value: 5
- name: vm.dirty_background_ratio
value: 5
- name: vm.dirty_ratio
value: 10
...