An open research problem concerns whether a given machine M is operating correctly at a given point in time t. There are a number of proprietary checking programs that can be run against a specific sub-component, e.g., hard disk, but these are mostly offline tests and unsuitable for our operating environment. It would have quickly become tedious to regularly reboot the servers with MS-DOS diskettes to run diagnostic programs.
Thus, we decided on a more pragmatic test. Our servers are defined to be operating correctly if they can successfully
- Compress the Linux v. 2.6.32.8 source code directory into a tarball file temp.tar.bz2
- Calculate the md5sum hash over the compressed tarball file temp.tar.bz2
- Verify that the md5sum hash does not differ from its initial value, as calculated indoors and assumed to be correct
This pattern is repeated every 20 minutes. It is designed to stress both the disk, CPU, and memory subsystems. If a deviation of the calculated md5sum pattern is found, the file temp.tar.bz2 is saved for further analysis. All checksum files are also transferred via scp to an external host to mitigate the effects of correlated disk failures (yielding broken arrays etc). This results in a small amount of additional load per server.
Interestingly, we do find broken md5sum hashes. In phase 2 we detected 6 differing md5sum calculations out of a total of 119 516 check executions. Four of the mismatches were found in the control group and two in the test (tent) group. By examining the resulting tarballs with the bzip2recover program, we found that only singular blocks had become corrupted in the packing process.
It is still unknown why the md5sums sometimes fail. Our best hypothesis is that the errors are caused by memory page faults, since none of the rack enclosures equipped with ECC DIMMs have snown problems. The six faulty sums were found from three identical COTS machines, making it possible that the batch of memory modules used in the original setups was partially faulty. By estimating the amount of memory pages read and written in the 119 516 check executions, we project the failure ratio to lie in the ballpark of one per 2,5 billion.
For phase 3, the current plan is to let the servers execute some well-known collaborative computing program, for example folding@home. By doing this, our equipment stress would perhaps resemble normal operating environments somewhat better. If you have ideas about suitable client programs, do not hesitate to leave comments on this post.