A small-ish quick update today: on March 10th, we finished the first iteration of Cold Air Containment at the Exactum data center / server room. Due to popular demand, I wrote about our initial results on the Department’s web page here:
The results were quite interesting! By just using makeshift plastics and duct tape, we were able to reduce our CRAC requirements from five units to four units, plus tune the remaining four units to considerably higher supply temperatures.
Please check out the links above for more info.
You know what this thing needs? A twitter-interface, that’s what!
You can now follow eight of the servers inside the Helsinki Chamber via Twitter. Each of the eight machines is running its own little monitoring-software that listens to our @helsinkichamber twitter-account and responds with information about it. We think this is really cool, and urge you to start stalking us there.
A hot day in Helsinki? Just @mention on of our servers and see if they’re frying inside their chassis!
A freakishly cold night? Had to pull another blanket on just so you wouldn’t freeze in your bed? That’s nothing! Our Helsinki Chamber just spent the night standing outside, in the cold, running menial and repetitive tasks! Ask for a graph from all of the servers and see which one froze the most!
Sounds interesting? Let’s introduce our servers:
- lost20: A Dell Poweredge running a 3,2 GHz Intel Xeon CPU. It’s hobbies include tennis and prime numbers.
- lost24: An old HP Proliant DL380G3 that still mourns the loss of Michael Jackson. On its free time lost24 tries to moonwalk, but you know, servers ain’t got no legs…
- lost25: The oldest of the four Proliant DL360-brothers in our rack, it was born 2 minutes before the others. Lost25 was always the responsible one of the four, and is considering a career in politics.
- lost26: The second oldest brother collects pins in its spare time and keeps them inside its spare 3.5″ drive bay.
- lost27: The third of the bunch is kind of a dandy. He prefers classic menswear and reeeeally hates it when the other servers come to work wearing cargo-pants and t-shirts with wolves on them.
- lost28: The youngest of the four likes pizza, is kinda reckless, wears an orange bandana and is deadly with nunchucks. Also thinks its a turtle.
- lost29: Still thinks jiggawatts is a real number.
- lost31: Is really just a combination of clever engineering, silicon, plastic, copper and aluminum, and has no persona, soul or consciousness. If we interpreted lost31:s log-files correctly, it wishes to join the local community theater. Either that, or sshd is mis-configured.
All kidding aside, @mention helsinkichamber, name the server (or servers) you want to query, and what you want from them.
So for the latest temperatures from lost20 you would say:
Or for graphs of the latest 24 hours from lost20 you would say:
So, what do you think? Are they crash-y? Could be. Did one of the servers just overheat? Maybe, that would actually be kinda something… Would you like to know more about the servers? Hit us up with a comment.
Very recently we finished intrumenting the power consumption meters attached to all three supply lines providing power to our server room in Exactum building. Some glitches from our testing phases remain visible, but the SVG graphs can be viewed here:
The surges and overflows are caused by erroneous inserts. Right now, the aggregate consumption is a pretty stable 60 kW — quite a lot less than what we initially feared.
Our server room is supplied with three separate, three-phase power lines. The voltage is 400 V in all cases, whereas the currents are 100 A, 100 A and 200 A.
Posted in Meta
Blocking vents with duct tape
Even though my relationship with duct tape has been both long and amiable, it is pretty rare that I’ve had the pleasure of using duct tape for its proper purpose. The last time was on Feb. 11th, when we finally blocked the mounting holes surrounding the front panels of our rack servers.
After this, we now have functioning Cold Air Containment (CAC), meaning that the cool supply air does not get mixed with the hot exhaust anymore. The duct tape is visible in the picture above as the actual “silver lining” surrounding our servers.
We were a bit worried, since straight after completing the containment Helsinki encountered maybe the coldest week of the year. Check any archive on weather data for some rather amazing periods of minus-twentysomething freezes. Luckily, I was in Denmark for most of the time and saw the drops only as complaints on Facebook: none of the computers have failed at the time of writing.
I will post some temperature graphs as soon as I get some time to download the sensor data from our loggers. As a sneak preview, even with CAC the temperature delta between the front low and high sensors is still 10 C. Our current best guess is that even without mixing the air flows, there is simply enough heat radiation in the front part of the HC to warrant the rather high delta.
Out since February 2010 -- still going strong
I wonder how hardy these stock issue CAT-5e/6 cables truly are? Ours has spent time outside since February 2010. The same cable has provided connectivity for the old tent and the new Helsinki Chamber. As you can see in the picture above, the cable is pretty much frozen solid. And yet:
— 2001:708:140:410::10 ping statistics —
200 packets transmitted, 200 received, 0% packet loss, time 198999ms
rtt min/avg/max/mdev = 0.360/0.411/0.495/0.033 ms
Some anecdotal evidence I remember seeing in a USENET post ca. 1998 a claimed that a CAT-5 cable drawn between two houses “usually” endured about 1-2 Finnish winters. Of course, the original poster may have had the installation somewhere further North, where the conditions are even more harsh.
This will be an interesting part of our experiments. The cable in the pic above passes just next to the two 16A power cables (they are certified for outdoors use) as they enter the HC, so it can be seen that the power drawn is not impeding the signal quality much either.
A new construction site will soon start just next to the Exactum building where the CS Department resides. A bunch of shielding experts visited the Dept on January 12th, 2011 in order to mitigate the effects of vibrations caused by the site’s demolition work.
It is somewhat uncertain how much nearby explosions can harm mechanical hard drives or other components. Sharp shocks can cause the drive heads to touch the platters, but the low rumble caused by a demolition is less understood. (Please leave a comment if you have better info.)
However, due to insurance reasons, the construction company must provide adequate shielding for nearby buildings and their [IT] equipment. For desktop workstations, the shielding in comes in the form of small rubber feet that are inserted between the floor or table and the PC cases. For racks and custom setups, like our Helsinki Chamber (HC), one needs… Bigger feet.
A small amount of snow did not hinder the quick work of our visiting consultants. After some initial head-scratching, we shoveled the snow away from the front and inserted a handy pneumatic lift under the pallets. The men left four rubber squares under each corner of the HC, and this researcher can now sleep a bit better while the nearby site blasts away.
- Tilting the HC with a pneumatic lift. Click picture for the complete picture album.
Through an offhand remark in a recent Slashdot post, we finally became aware of Microsoft’s earlier experiment that also involved computer equipment in a tent:
It’s unfortunate, but not totally unexpected, that we missed this while researching our related work section. The “Power of Software” blog has been updated pretty infrequently lately (twice in 2009, once in 2010). Human tendency to assume also became a factor, as we simply couldn’t believe that somebody else would have used a tent in their experiment.
Sadly, there isn’t enough technical details to do a step-by-step analysis of our work vs James’ and Belady’s. If the experiment was operated in the greater Seattle area, we probably ran our servers for a much wider range of temperatures and humidities. And in any case, it seems that we’ve independently verified both Microsoft’s and Intel’s previous reports.
An open research problem concerns whether a given machine M is operating correctly at a given point in time t. There are a number of proprietary checking programs that can be run against a specific sub-component, e.g., hard disk, but these are mostly offline tests and unsuitable for our operating environment. It would have quickly become tedious to regularly reboot the servers with MS-DOS diskettes to run diagnostic programs.
Thus, we decided on a more pragmatic test. Our servers are defined to be operating correctly if they can successfully
- Compress the Linux v. 220.127.116.11 source code directory into a tarball file temp.tar.bz2
- Calculate the md5sum hash over the compressed tarball file temp.tar.bz2
- Verify that the md5sum hash does not differ from its initial value, as calculated indoors and assumed to be correct
This pattern is repeated every 20 minutes. It is designed to stress both the disk, CPU, and memory subsystems. If a deviation of the calculated md5sum pattern is found, the file temp.tar.bz2 is saved for further analysis. All checksum files are also transferred via scp to an external host to mitigate the effects of correlated disk failures (yielding broken arrays etc). This results in a small amount of additional load per server.
Interestingly, we do find broken md5sum hashes. In phase 2 we detected 6 differing md5sum calculations out of a total of 119 516 check executions. Four of the mismatches were found in the control group and two in the test (tent) group. By examining the resulting tarballs with the bzip2recover program, we found that only singular blocks had become corrupted in the packing process.
It is still unknown why the md5sums sometimes fail. Our best hypothesis is that the errors are caused by memory page faults, since none of the rack enclosures equipped with ECC DIMMs have snown problems. The six faulty sums were found from three identical COTS machines, making it possible that the batch of memory modules used in the original setups was partially faulty. By estimating the amount of memory pages read and written in the 119 516 check executions, we project the failure ratio to lie in the ballpark of one per 2,5 billion.
For phase 3, the current plan is to let the servers execute some well-known collaborative computing program, for example folding@home. By doing this, our equipment stress would perhaps resemble normal operating environments somewhat better. If you have ideas about suitable client programs, do not hesitate to leave comments on this post.
Very busy day today, so I’ve only got time to do a small teaser update. As the blogs still lags behind the actual research project, I’m publishing this sneak peak of what the situation looks this week. The following Picasa web album contains pics of the operational phase 3 Helsinki Chamber:
During our Finnish Independence Day, we got some 43 cm of snow in Kumpula, our campus area. There was a little bit of shoveling to do the next day, but luckily I remembered to take pictures this time.
After the tent had been erected on the roof, we gradually installed and moved servers into it. The attached timeline shows how between Feb. 19th and Mar. 10th each server was moved into the tent and left to execute its synthetic workload. In a later post, I will describe the workload in more detail.
As mentioned, the tent had some air flow issues. The idea was to position the servers so that they would intake air from the back of the tent, from beneath the flap attached to the floor, and also below the raised floor of the terrace. Conversely, hot air would flow naturally upwards to the tent and then gradually exit both through heat radiation and through the front of the tent.
The first problem became the exhaust. As the tent schematic shows, the front of the tent was fairly diagonal and thus, a natural barrier for the exhaust flow. We tried to alleviate the situation through a number of means.
- leaving the inner door open
- cutting open the inner tent structure
- cutting open the tarpaulin of the tent floor
- inserting a mechanical fan
- leaving the front door (far right in schematic) partially open
Cutting open the inner tent was a simple extension of leaving the inner door open. The idea in both kludges was to get the exhaust heat out faster, by allowing it more space. Similary, by cutting open the floor tarpaulin we tried to improve the intake air supply. As there was about 40 cm of empty space below the floor, there should’ve been ample supply air flow.
Finally, we inserted a mechanical fan into the tent for the warmer summer months. By doing this, we eschewed the target PUE of 1.0, since we now sacrified some of the total load for additional cooling. The final PUE for the tent was calculated as 1,0878, which is still a reasonable number, as this figure includes no hidden cooling costs like cooling liquid supplied from elsewhere.
Posted in Meta, Server Tent