How to fix "BUG: soft lockup - CPU#0 stuck for 17163091968s"?

Solution 1:

Thanks to all commenters. I think I found the answer. There seems to be a timekeeping bug in at least Ubuntu's kernel version 2.6.32-30-server. The bug sometimes (?) kills machines when they reach an uptime of about 200..210 days. Actually the halt does not happen immediately after the limit is reached, but is triggered by some operation (in my case: apt-get install ...).

NB: 200 days is about 2^32 times 1/250 second, and 250 is the default value for CONFIG_HZ.

For now, I haven't found data on whether the problem has been fixed in more recent kernels. I do know that it doesn't seem to affect an older kernel (2.6.32-26-server). From all this information I presume that if it's not fixed yet, it can be avoided by:

  • boot the machines every 190 days (a good idea for kernel upgrades anyway)
  • adjust CONFIG_HZ to 100 and thus make it every 497 days. However, this might have quite unexpected side effects, especially in virtual environments. And it doesn't solve the problem.

Here's a bug report for Ubuntu.

Solution 2:

This is actually a kernel bug that got fixed by the following kernel commit :

http://git.kernel.org/?p=linux/kernel/git/tip/tip.git;a=commit;h=4cecf6d401a01d054afc1e5f605bcbfe553cb9b9

You can search LKML for the following title (cannot post more than 2 links) : [stable] 2.6.32.21 - uptime related crashes?

And this is the LP# bug that brings the kernel fix :

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/902317

Upgrading to the latest kernel in lucid-updates should fix this issue for good.

HTH


Solution 3:

Could it be that the virtualisation host has some power-saving features ("Green IT") enabled that could send unused cores into a low-power/sleep mode, causing interesting disruptions in the VMs using that core? I've heard this used to be a problem mainly in HyperV environments but it may be something to look into.