Since building my new KVM server in January, I’ve been experiencing random occasional system crashes – sometimes months apart, other times a couple in a week.
I’ve been trying to trace the cause, but this fault is hard to diagnose – there’s never been anything output to display, nor anything in system or even BIOS logs.
- Unlikely to be a kernel panic, no output to console, nor syslog, plus running CentOS 6 kernels which are usually pretty damn stable being rebuilt RHEL kernels…
- Unlikely to be some weird disk fault, I replaced all the disks recently but still experiencing the same issue – and when the fault occurs, all the RAID arrays get upset, even independent arrays on different controllers.
- Possibly a motherboard firmware issue, however upgraded to latest BIOS version and unable to find any similar problems online.
- Possibly CPU/Memory/Motherboard hardware faults.
However with the recent addition of a properly configured Munin server to my network, I’ve started graphing all the temperature sensors from my server – what I found, is that the AMD Phenom II 810 CPU is running hot – very hot infact, at around 60-70c, and the crashes were occuring once the CPU peaked at 70c.
I had initially discounted thermal problems, since the case has great cooling and I’ve never historically had calling problems with the stock cooler on AMD CPUs, especially since the CPU is not being overclocked.
However unlike many other systems I’ve built, this particular host is always heavily loaded – I’m running about 20-30 KVM virtual machines on it and there’s always a whole bunch of active processes, plus disk encryption CPU overhead.
And looking at the stock cooler, it’s not surprising that it’s been overheating – it’s basically a block of plain aluminium – there aren’t even any heat pipes, unlike the stock cooler that ships with the black edition model.
So I’ve replaced the heatsink with a nice new Zalman CNPS9700LED copper cooler – it’s a big beast, 790g and certainly wouldn’t fit in a lot of cases – but once installed, you can feel how the large fan blows air out over all the copper fins – there’s a really good airflow with the design to ensure heat gets radiated off quickly.
Here’s the pretty graphs showing the difference that this cooler made to my server – please excuse some of the gaps, Munin has been having a bit of fun with virtualised workloads and timeouts….
It’s well worth the ~ $100 NZD based on the thermal difference it made – my CPU has gone from 60-70c down to 30-40c and so far, the server is running solid without fault.
From the graph, the CPU (green) is running a good 20+ degrees cooler than it previously did, but in addition, the motherboard chipset (blue) is also running cooler – most likely caused by the CPU cooler fan pulling in air right over the heatsink on the motherboard, assisting it with cooling.
(I’m running the cooler at low speed with the stock Zalman thermal paste too – if you turned the fan speed up higher or used fancier thermal pastes, lower temperatures or more thermal conductivity might be possible.)
In terms of the hardware supplied by Zalman, it’s a pretty good package – the cooler comes with some decent thermal paste, application brush, cooler, and connectors for various model CPUs.
My only complaint was that the design with the AM3 socket, means that the cooler outtake doesn’t line up with the rear case fan – however this is a lesser problem, since the hot air radiates out all over the cooler and is quickly removed by the fan anyway.
In terms of whether it’s fixed my issue, it remains to be seen – the crashes were not always consistent and I won’t call it as “fixed” until I get 4 months solid run time without occurrences, but I’m optimistic.