Fix hung_task_timeout_secs issue #57

brtkwr · 2019-03-11T17:33:37Z

We are repeatedly seeing this kernel panic and this is probably a solution... it is difficult to replicate as it happens randomly on the different nodes when running slurm jobs... lets see if this fixes the problem (this has a good explanation for why this could work: https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/)

Log snippet:

[  241.278263] INFO: task systemd-udevd:140 blocked for more than 120 seconds.
[  241.286338] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  241.295607] systemd-udevd   D ffff88002de7eeb0     0   140     91 0x00100004
[  241.303947] Call Trace:
[  241.306976]  [<ffffffff81718f39>] schedule+0x29/0x70
[  241.312818]  [<ffffffff817168a9>] schedule_timeout+0x239/0x2c0
[  241.319639]  [<ffffffff810cea5a>] ? check_preempt_curr+0x8a/0xa0
[  241.326643]  [<ffffffff810cea89>] ? ttwu_do_wakeup+0x19/0xe0
[  241.333263]  [<ffffffff817192ed>] wait_for_completion+0xfd/0x140
[  241.340269]  [<ffffffff810d2010>] ? wake_up_state+0x20/0x20
[  241.346790]  [<ffffffff810bdd7a>] kthread_create_on_node+0xaa/0x140
[  241.354083]  [<ffffffff810b6400>] ? process_one_work+0x440/0x440
[  241.361092]  [<ffffffff810b9407>] __alloc_workqueue_key+0x327/0x5a0
[  241.368387]  [<ffffffff810cea89>] ? ttwu_do_wakeup+0x19/0xe0
[  241.375004]  [<ffffffff810d1d8c>] ? try_to_wake_up+0x18c/0x350
[  241.381820]  [<ffffffff814a11eb>] scsi_host_alloc+0x39b/0x4b0
[  241.388540]  [<ffffffffc01b03f8>] ata_scsi_add_hosts+0x78/0x1b0 [libata]
[  241.396345]  [<ffffffffc01aad5b>] ata_host_register+0x11b/0x2c0 [libata]
[  241.404133]  [<ffffffffc01aafb6>] ata_host_activate+0xb6/0x130 [libata]
[  241.411817]  [<ffffffffc01162e0>] ? ahci_handle_port_interrupt+0x550/0x550 [libahci]
[  241.426386]  [<ffffffffc0116494>] ahci_host_activate+0x44/0x130 [libahci]
[  241.434271]  [<ffffffff8139b106>] ? pcibios_set_master+0x76/0xa0
[  241.441279]  [<ffffffffc01dfdbd>] ahci_init_one+0x6cd/0xb72 [ahci]
[  241.448486]  [<ffffffff8139c82a>] local_pci_probe+0x4a/0xb0
[  241.455003]  [<ffffffff8139df69>] pci_device_probe+0x109/0x160
[  241.461825]  [<ffffffff814783b5>] driver_probe_device+0xc5/0x3e0
[  241.468828]  [<ffffffff814787b3>] __driver_attach+0x93/0xa0
[  241.475351]  [<ffffffff81478720>] ? __device_attach+0x50/0x50
[  241.482066]  [<ffffffff81475f55>] bus_for_each_dev+0x75/0xc0
[  241.488686]  [<ffffffff81477d2e>] driver_attach+0x1e/0x20
[  241.495016]  [<ffffffff814777d0>] bus_add_driver+0x200/0x2d0
[  241.501630]  [<ffffffff81478e44>] driver_register+0x64/0xf0
[  241.508156]  [<ffffffff8139d7a5>] __pci_register_driver+0xa5/0xc0
[  241.515261]  [<ffffffffc01e9000>] ? 0xffffffffc01e8fff
[  241.521291]  [<ffffffffc01e901e>] ahci_pci_driver_init+0x1e/0x1000 [ahci]
[  241.529174]  [<ffffffff8100210a>] do_one_initcall+0xba/0x240
[  241.535793]  [<ffffffff8111286c>] load_module+0x272c/0x2bc0
[  241.542314]  [<ffffffff8137b460>] ? ddebug_proc_write+0xf0/0xf0
[  241.549224]  [<ffffffff811d8740>] ? vmap_page_range_noflush+0x2c0/0x3f0
[  241.556914]  [<ffffffff81112dc5>] SyS_init_module+0xc5/0x110
[  241.563535]  [<ffffffff8172579b>] system_call_fastpath+0x22/0x27
[  272.987775] scsi host1: ahci

brtkwr · 2019-03-11T21:59:53Z

More backing docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-tunables

markgoddard · 2019-04-10T10:52:38Z

Has this been shown to work?

brtkwr · 2019-04-10T22:25:31Z

Haven't had any issues so far since it was applied

markgoddard · 2019-04-11T14:15:49Z

ansible/roles/cluster_setup/tasks/sysctl.yml

  with_items:
-    - { name: kernel.panic, value: 5 }
+    - name: kernel.hung_task_timeout_secs
+      value: 0


This is definitely going to make the message go away, since it stops the kernel checking for hung tasks. Without the message, we're not going to know if we fixed the underlying issue.

Hmm well we can always put it back if we want to debug.... its just quite annoying as this is part of the reason why nodes won't reboot automatically. By default, we dont want things to hang... this should be an opt in.

But this option just controls whether the kernel reports that tasks have hung, it doesn't affect whether tasks will hang.

I think this is what is preventing automatic reboot though?

It's just for reporting, see https://www.kernel.org/doc/Documentation/sysctl/kernel.txt.

This article suggests you can panic on hung tasks by setting hung kernel.hung_task_panic=1, from kernel 2.6.35. Then to cause the kernel to reboot on panic, you set kernel.panic= (as you have done).

Maybe a solution to the hung_task_timeout_secs issue

c7e24d4

Bharat Kunwar added 2 commits April 9, 2019 14:38

Set hung_task_timeout_secs to 0

f684273

Merge branch 'master' into hung_task_timeout_secs

c7a23e5

brtkwr requested review from markgoddard and oneswig April 9, 2019 13:49

brtkwr changed the title ~~Maybe a solution to the hung_task_timeout_secs issue~~ Fix hung_task_timeout_secs issue Apr 9, 2019

markgoddard reviewed Apr 11, 2019

View reviewed changes

brtkwr added WIP work in progress and removed WIP labels May 21, 2019

brtkwr force-pushed the master branch 2 times, most recently from 5f7faa8 to b1b77f3 Compare May 24, 2019 15:05

brtkwr force-pushed the master branch from a145cfd to d6c97a0 Compare March 16, 2021 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix hung_task_timeout_secs issue #57

Fix hung_task_timeout_secs issue #57

Uh oh!

brtkwr commented Mar 11, 2019

Uh oh!

brtkwr commented Mar 11, 2019

Uh oh!

markgoddard commented Apr 10, 2019

Uh oh!

brtkwr commented Apr 10, 2019

Uh oh!

markgoddard Apr 11, 2019

Uh oh!

brtkwr Apr 11, 2019

Uh oh!

markgoddard Apr 11, 2019

Uh oh!

brtkwr Apr 12, 2019

Uh oh!

markgoddard Apr 12, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix hung_task_timeout_secs issue #57

Are you sure you want to change the base?

Fix hung_task_timeout_secs issue #57

Uh oh!

Conversation

brtkwr commented Mar 11, 2019

Uh oh!

brtkwr commented Mar 11, 2019

Uh oh!

markgoddard commented Apr 10, 2019

Uh oh!

brtkwr commented Apr 10, 2019

Uh oh!

markgoddard Apr 11, 2019

Choose a reason for hiding this comment

Uh oh!

brtkwr Apr 11, 2019

Choose a reason for hiding this comment

Uh oh!

markgoddard Apr 11, 2019

Choose a reason for hiding this comment

Uh oh!

brtkwr Apr 12, 2019

Choose a reason for hiding this comment

Uh oh!

markgoddard Apr 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

markgoddard Apr 12, 2019 •

edited

Loading