Clusters of message-passing computing nodes provide high-performance platforms for distributed applications. Cost-effective implementations of such systems are based on commercial off-the-shelf (COTS) hardware and software components. One trend in the deployment of such systems is to scale up the number of compute nodes to deliver higher performance levels. The higher component count results in a corresponding higher rate of failure. Another trend is to deploy clusters for mission-critical applications or in harsh environments, where reliability requirements are higher than in a controlled lab setting. Both of these trends point to an increasing need to employ fault tolerance techniques to meet the reliability requirements of the applications being executed.
We present a layered approach to providing fault tolerance for message-passing applications on compute clusters that are based on COTS hardware components, COTS operating systems, and a COTS API for application programmers. This approach relies on highly-resilient cluster management middleware (CMM) that ensures the survival of key system services despite the failure of cluster components. A key feature of this CMM is that it provides services that enable and simplify user-level implementation of fault tolerance for applications without dictating the specific techniques employed. In particular, while application-transparent techniques are supported, the CMM also supports application-specific techniques that are tailored and optimized for the characteristics and requirements of specific applications. To this end, we have developed an API that can be used in the implementation of fault tolerance by the application programmer as well as by developers of user-level libraries that provide application-transparent fault tolerance.
The effectiveness of our layered approach is demonstrated and evaluated with several applications employing different techniques for fault tolerance. The entire system is subjected to a fault injection campaign. We show that the CMM services that support fault tolerance techniques operate reliably and with very low overhead. We also show that application-specific fault tolerance techniques detect and recover from a vast majority of manifested faults while imposing much lower performance overhead than application-transparent schemes.