I confess to having had a bad case of writer’s block with this blog. Subsequently, there have been no updates for the past seven months. I’m correcting this issue now.
The cause for the blog blackout has been a number of fan failures affecting a specific brand and model of servers in the Helsinki Chamber (HC). As a scientist, I’ve tried to figure out conclusive proof for the cause of the faults, and having been unable to do so, I have been unwilling to publish partial results. My hope is that this blog post might attract the web searches of a system administrator who has experienced similar failures and been able to deduct their root cause.
In our case, the failing servers are 1U sized HP DL360 G3 models. In the two years we’ve been running experiments using free air-side cooling using direct, unconditioned outside air, these servers have been the only ones to exhibit systematic failures. This kind of failures has also been called common-mode (CMF) or common-cause failures (CCF). It is of particular interest to me, as this phenomenon is the very one that I originally set out to study with the direct free air cooling experiments.
We have been using the same HP DL360 G3 models in our regular data center for years, and in no way have they been more or less prone to failures than other server brands or models. I am not claiming that these models are flawed although they do fail regularly when cooled with outside air. Similarly, I can not claim that direct free air cooling would be an unfeasible technique: the numerous other models we have used have not exhibited CMFs. What can be said is that with a reasonable probability, there exists a server fan type which is unsuitable for direct free air cooling, and a number of other server fan types which remain suitable.
The servers in question start to fail with the following warnings in their HP Integrated Management Log (IML).
Event: 20 Added: 06/14/2011 02:32 CRITICAL: Machine Environment - Fan Failure (Fan 1, Location CPU). Event: 21 Added: 06/14/2011 02:32 CRITICAL: OS Class - Automatic Operating System Shutdown Initiated Due to Fan Failure. Event: 22 Added: 06/13/2011 23:59 CAUTION: POST Messages - POST Error: 1611-CPU Zone Fan Assembly Failure Detected. Event: 23 Added: 06/13/2011 23:59 CAUTION: POST Messages - POST Error: Fan Solution Not Sufficient. Event: 24 Added: 06/14/2011 00:59
The errors are duplicated into syslog through the hpasmd daemon, if it is running:
Oct 23 09:00:15 lost25 hpasmd: CRITICAL: hpasmd: Fan Failure (Fan 1, Location CPU) Oct 23 09:00:15 lost25 hpasmd: CRITICAL: hpasmd: Automatic Operating System Shutdown Initiated Due to Fan Failure
Oct 23 09:00:22 lost25 hpasmd: NOTICE: hpasmd: Fan Failure (Fan 1, Location CPU) has been repaired Oct 23 09:00:22 lost25 hpasmd: NOTICE: hpasmd: Automatic Operating System Shutdown Initiated Due to Fan Failure has been repaired
What happens is that the CPU fan block depicted in Fig.1 tells the system management board that there is a failing fan. As there does not seem to be any redundancy in the block despite the four fan assemblies, even a single faulty fan is enough to cause the errors to surface in the block. I have experimented this by shifting fans one-by-one from a malfunctioning server to a correctly functioning server, until the latter starts to show the same symptoms as the previous.
Initially, the errors are simply warnings and the malfunctioning server will recover, aborting the automatic operating system shutdown. Later on, this will happen less and less, causing the server to go into a delayed reboot loop. The OS will shut down and the server will remain shut down for a varying number of minutes, and after this, the system board will try again. After a few minutes of operating, a new critical warning is logged and the OS is shut down again.
Despite the logged errors, visual inspection reveals that the fan assemblies remain operational and keep rotating until the shutdown. Since we have been running the servers in pretty low temperatures, even a completely dead fan block would probably have been sufficient for normal operation.
In our case, a total of seven HP DL360 G3 fan blocks have failed with this type of problem. We initially installed five units in the HC, and after discovering these problems, I replaced two fan blocks with used fan blocks from spare servers. I also reconstructed three more “correct” fan blocks by marking the failed fans and shifting unmarked fans until a fan block no longer reported problems.
These replacements rule out individual events like power spikes which might have destroyed the fans. Likewise, web searches reveal no error reports supporting the idea that the problems would be caused by faulty firmware, perhaps solvable through an upgrade. As only the fan blocks placed in the HC have failed, the root cause does not seem to be a manufacturing or handling error either. Finally, the problems are not caused by the fan assemblies clogging with pollen or lint, which I have verified by breaking down the assemblies into their base components.
The end result is that all seven fan blocks ultimately caused delayed reboot loops, and we were forced to remove the HP DL360 G3:s from the HC. The identical models purchased at the same time in our regular data center have not caused problems, and neither has the 2U-sized HP DL380 G3 which is still installed in the HC.
The fan used in the CPU fan block are Nidec DR04XLG-12PUS1 40*40*48 mm units. A web search will yield other users who have had problems with this fan type, but not enough to claim that the model would be systematically erroneous. Our own control groups also disproves this idea, as the servers in our regular data centers have not failed.
What is peculiar about these fans is that they are double fan units, i.e., there are two fans connected in a serial fashion. In addition, the second fan is reversed and rotates in the opposite direction. This behaviour is our current best guess on what goes wrong.
Buest guess at cause
The motor section of the Nidec fans is visible in Figure 2. As in normal fans, the motor is located in the center of the fan unit and the fan blades rotate in front of the motor unit. My theory is that in a normal fan, the blades work somewhat like an umbrella, pushing the humidity in the air away from the motor shielded in the middle of the unit.
Since in the units the second fan unit is reversed, the internals of the motor become more prone to any humidity in the air. This might cause transient faults which the fan unit then reports to the fan block, and onwards to the system management board.
So far, we have figured out no solutions for the problem. What I’m planning is an emulated experiment by removing one of the DL360 G3:s from the control group and trying to see if I can make the server fail indoors by raising the relative humidity of the air. If so, the cause of the problems is the humidity combined with the reversed air flow.
If there are any readers with hands-on experience with this type of errors, we are interested in hearing from you. My user account is pervila at the Department of CS servers, so you can easily figure out my e-mail address.