Measuring temperatures: part 1

This is post number #1 in a series on how to graph temperature in datacenter from scratch. This is an easy and suitable building project for almost all age groups, competence levels and even budgets. Parts 2 and 3 are published 24. and 26. April, respectively.

Machine room temperatures are often measured using built-in thermal sensors in computers. It is relatively easy to use SNMP to get data from CPU thermal diodes, inlet and outlet air flow temperatures, and then integrate them into some monitoring system with alerts. But the big picture can’t be conveyed using sensor points inside the computers. Design and implementation of a machine room should really include ambient temperature measurements to determine air flow and cooling system supply air temperatures.

Manufacturers tend to have solutions that are prohibitively expensive for bulk use (“starting at $325″), are big (size of a cigarette pack or more) and unwieldy (require separate cabling and electricity supply per unit). Enter summer 1989 and Dallas Semiconductor’s (DS) 1-Wire network (also known as MicroLAN™ or µLAN). This is a low-current, low-voltage bus which requires, at minimum, two conductors for data and power, simplifying cabling, design, and implementation. By 1990, the 1-Wire network protocol had matured and DS introduced first stainless-steel packaged, rugged battery-like memory devices, readable by contact to a reader connected to 1-Wire network. Small, inexpensive, TO92-packaged (small transistor) 1-Wire temperature and humidity sensors have been available at least since the early 1990’s.

1-Wire can use a multitude of topologies, but reliable networks are easiest implemented using bus topology, like in old coaxial ethernet networks. Star topology is not recommended unless 1-Wire switches are used. The term “1-Wire” is a bit misleading, because the network requires at least two conductors: one for data and operating power supply (for so-called parasitic powered devices) and another for signal/power ground. Common reference level (GND, ground) is required, just as with regular RS232 serial port which requires at least TX, RX and GND. Most 1-Wire attached devices can also be externally powered. They then have at least one more conductor for separate operating power supply (usually between 2.5V and 5.5V) and possibly, but not necessarily, a fourth conductor for additional ground lead. Usually signal and power ground can be common.

In other words, 1-Wire networks use either two or three conductors. The parasitic powered networks (two conductors, data and data ground) have stricter limitations than externally powered network with regards to network size and amount of devices that can be attached to the network. Externally powered networks can have lengths of up to 300m and contain tens of sensors. There is probably some upper limit, but DS writes in their application notes that the amount of sensors is virtually unlimited, because every sensor has unique 64-bit ID code.

Voltages and currents used in 1-Wire networks are very small. The devices have idle power requirements of less than 1000 nA and active power is typically less than 1.5 mA. Voltage swing is from -0.8V to +2.2V (minimum for externally powered) or +3.0V (minimum for parasitic powered devices). Suffice to say, for long network runs it is advisable to use good quality, low capacitance (<50pF/m) and low resistance twisted-pair cable and do the connectors in the cable in a professional manner. With little practice and effort, it is easy to build a reliable network a hundred meters long with 15 or more externally powered 1-Wire devices.

Posted in Meta | Leave a comment

Experiences from constructing Cold Air Containment

A small-ish quick update today: on March 10th, we finished the first iteration of Cold Air Containment at the Exactum data center / server room. Due to popular demand, I wrote about our initial results on the Department’s web page here:

The results were quite interesting! By just using makeshift plastics and duct tape, we were able to reduce our CRAC requirements from five units to four units, plus tune the remaining four units to considerably higher supply temperatures.

Please check out the links above for more info.

Posted in Meta | Leave a comment

Keep your finger on the temperature sensor with Twitter!

You know what this thing needs? A twitter-interface, that’s what!

You can now follow eight of the servers inside the Helsinki Chamber via Twitter. Each of the eight machines is running its own little monitoring-software that listens to our @helsinkichamber twitter-account and responds with information about it. We think this is really cool, and urge you to start stalking us there.

A hot day in Helsinki? Just @mention on of our servers and see if they’re frying inside their chassis!

A freakishly cold night? Had to pull another blanket on just so you wouldn’t freeze in your bed? That’s nothing! Our Helsinki Chamber just spent the night standing outside, in the cold, running menial and repetitive tasks! Ask for a graph from all of the servers and see which one froze the most!

Sounds interesting? Let’s introduce our servers:

  • lost20: A Dell Poweredge running a 3,2 GHz Intel Xeon CPU. It’s hobbies include tennis and prime numbers.
  • lost24: An old HP Proliant DL380G3 that still mourns the loss of Michael Jackson. On its free time lost24 tries to moonwalk, but you know, servers ain’t got no legs…
  • lost25: The oldest of the four Proliant DL360-brothers in our rack, it was born 2 minutes before the others. Lost25 was always the responsible one of the four, and is considering a career in politics.
  • lost26: The second oldest brother collects pins in its spare time and keeps them inside its spare 3.5″ drive bay.
  • lost27: The third of the bunch is kind of a dandy. He prefers classic menswear and reeeeally hates it when the other servers come to work wearing cargo-pants and t-shirts with wolves on them.
  • lost28: The youngest of the four likes pizza, is kinda reckless, wears an orange bandana and is deadly with nunchucks. Also thinks its a turtle.
  • lost29: Still thinks jiggawatts is a real number.
  • lost31: Is really just a combination of clever engineering, silicon, plastic, copper and aluminum, and has no persona, soul or consciousness. If we interpreted lost31:s log-files correctly, it wishes to join the local community theater. Either that, or sshd is mis-configured.

All kidding aside, @mention helsinkichamber, name the server (or servers) you want to query, and what you want from them.

So for the latest temperatures from lost20 you would say:

Or for graphs of the latest 24 hours from lost20 you would say:

So, what do you think? Are they crash-y? Could be. Did one of the servers just overheat? Maybe, that would actually be kinda something… Would you like to know more about the servers? Hit us up with a comment.

Have fun!

Posted in Helsinki Chamber | 1 Response

Power consumption measurements

Very recently we finished intrumenting the power consumption meters attached to all three supply lines providing power to our server room in Exactum building. Some glitches from our testing phases remain visible, but the SVG graphs can be viewed here:

The surges and overflows are caused by erroneous inserts. Right now, the aggregate consumption is a pretty stable 60 kW — quite a lot less than what we initially feared.

Our server room is supplied with three separate, three-phase power lines. The voltage is 400 V in all cases, whereas the currents are 100 A, 100 A and 200 A.

SVG picture showing aggregate power consumption

Posted in Meta | Tagged | Leave a comment

Proper use of duct tape, part 1

Helsinki Chamber from the front

Blocking vents with duct tape

Even though my relationship with duct tape has been both long and amiable, it is pretty rare that I’ve had the pleasure of using duct tape for its proper purpose. The last time was on Feb. 11th, when we finally blocked the mounting holes surrounding the front panels of our rack servers.

After this, we now have functioning Cold Air Containment (CAC), meaning that the cool supply air does not get mixed with the hot exhaust anymore. The duct tape is visible in the picture above as the actual “silver lining” surrounding our servers.

We were a bit worried, since straight after completing the containment Helsinki encountered maybe the coldest week of the year. Check any archive on weather data for some rather amazing periods of minus-twentysomething freezes. Luckily, I was in Denmark for most of the time and saw the drops only as complaints on Facebook: none of the computers have failed at the time of writing.

I will post some temperature graphs as soon as I get some time to download the sensor data from our loggers. As a sneak preview, even with CAC the temperature delta between the front low and high sensors is still 10 C. Our current best guess is that even without mixing the air flows, there is simply enough heat radiation in the front part of the HC to warrant the rather high delta.

Posted in Helsinki Chamber | Tagged | Leave a comment

The Fortitude of CATs

Frozen CAT-6 cable

Out since February 2010 -- still going strong

I wonder how hardy these stock issue CAT-5e/6 cables truly are? Ours has spent time outside since February 2010. The same cable has provided connectivity for the old tent and the new Helsinki Chamber. As you can see in the picture above, the cable is pretty much frozen solid. And yet:

— 2001:708:140:410::10 ping statistics —
200 packets transmitted, 200 received, 0% packet loss, time 198999ms
rtt min/avg/max/mdev = 0.360/0.411/0.495/0.033 ms

Some anecdotal evidence I remember seeing in a USENET post ca. 1998 a claimed that a CAT-5 cable drawn between two houses “usually” endured about 1-2 Finnish winters. Of course, the original poster may have had the installation somewhere further North, where the conditions are even more harsh.

This will be an interesting part of our experiments. The cable in the pic above passes just next to the two 16A power cables (they are certified for outdoors use) as they enter the HC, so it can be seen that the power drawn is not impeding the signal quality much either.

Posted in Helsinki Chamber, Server Tent | Tagged | Leave a comment

Rubber feet vs explosions

A new construction site will soon start just next to the Exactum building where the CS Department resides. A bunch of shielding experts visited the Dept on January 12th, 2011 in order to mitigate the effects of vibrations caused by the site’s demolition work.

It is somewhat uncertain how much nearby explosions can harm mechanical hard drives or other components. Sharp shocks can cause the drive heads to touch the platters, but the low rumble caused by a demolition is less understood. (Please leave a comment if you have better info.)

However, due to insurance reasons, the construction company must provide adequate shielding for nearby buildings and their [IT] equipment. For desktop workstations, the shielding in comes in the form of small rubber feet that are inserted between the floor or table and the PC cases. For racks and custom setups, like our Helsinki Chamber (HC), one needs… Bigger feet.

A small amount of snow did not hinder the quick work of our visiting consultants. After some initial head-scratching, we shoveled the snow away from the front and inserted a handy pneumatic lift under the pallets. The men left four rubber squares under each corner of the HC, and this researcher can now sleep a bit better while the nearby site blasts away.

Tilting the HC with a pneumatic lift. Click picture for the complete picture album.
Posted in Helsinki Chamber | Tagged | Leave a comment

The Other Tent

Through an offhand remark in a recent Slashdot post, we finally became aware of Microsoft’s earlier experiment that also involved computer equipment in a tent:

It’s unfortunate, but not totally unexpected, that we missed this while researching our related work section. The “Power of Software” blog has been updated pretty infrequently lately (twice in 2009, once in 2010). Human tendency to assume also became a factor, as we simply couldn’t believe that somebody else would have used a tent in their experiment.

Sadly, there isn’t enough technical details to do a step-by-step analysis of our work vs James’ and Belady’s. If the experiment was operated in the greater Seattle area, we probably ran our servers for a much wider range of temperatures and humidities. And in any case, it seems that we’ve independently verified both Microsoft’s and Intel’s previous reports.

Posted in Meta, Others | Leave a comment

Synthetic workloads

An open research problem concerns whether a given machine M is operating correctly at a given point in time t. There are a number of proprietary checking programs that can be run against a specific sub-component, e.g., hard disk, but these are mostly offline tests and unsuitable for our operating environment. It would have quickly become tedious to regularly reboot the servers with MS-DOS diskettes to run diagnostic programs.

Thus, we decided on a more pragmatic test. Our servers are defined to be operating correctly if they can successfully

  1. Compress the Linux v. source code directory into a tarball file temp.tar.bz2
  2. Calculate the md5sum hash over the compressed tarball file temp.tar.bz2
  3. Verify that the md5sum hash does not differ from its initial value, as calculated indoors and assumed to be correct

This pattern is repeated every 20 minutes. It is designed to stress both the disk, CPU, and memory subsystems. If a deviation of the calculated md5sum pattern is found, the file temp.tar.bz2 is saved for further analysis. All checksum files are also transferred via scp to an external host to mitigate the effects of correlated disk failures (yielding broken arrays etc). This results in a small amount of additional load per server.

Interestingly, we do find broken md5sum hashes. In phase 2 we detected 6 differing md5sum calculations out of a total of 119 516 check executions. Four of the mismatches were found in the control group and two in the test (tent) group. By examining the resulting tarballs with the bzip2recover program, we found that only singular blocks had become corrupted in the packing process.

It is still unknown why the md5sums sometimes fail. Our best hypothesis is that the errors are caused by memory page faults, since none of the rack enclosures equipped with ECC DIMMs have snown problems. The six faulty sums were found from three identical COTS machines, making it possible that the batch of memory modules used in the original setups was partially faulty. By estimating the amount of memory pages read and written in the 119 516 check executions, we project the failure ratio to lie in the ballpark of one per 2,5 billion.

For phase 3, the current plan is to let the servers execute some well-known collaborative computing program, for example folding@home. By doing this, our equipment stress would perhaps resemble normal operating environments somewhat better. If you have ideas about suitable client programs, do not hesitate to leave comments on this post.

Posted in Helsinki Chamber, Server Tent | Leave a comment

Interlude: Snowdrifts of Independence Day

Very busy day today, so I’ve only got time to do a small teaser update. As the blogs still lags behind the actual research project, I’m publishing this sneak peak of what the situation looks this week. The following Picasa web album contains pics of the operational phase 3 Helsinki Chamber:

During our Finnish Independence Day, we got some 43 cm of snow in Kumpula, our campus area. There was a little bit of shoveling to do the next day, but luckily I remembered to take pictures this time.

Posted in Helsinki Chamber | Tagged | Leave a comment