SFT: Scalable Fault Tolerance
As the number of processors for multi-teraflop systems grows to tens of thousands, with proposed petaflops systems likely to contain hundreds of thousands of processors, the assumption of fully reliable hardware and software has been abandoned. Although the mean time between failures for the individual components can be very high, the large total component count will inevitably lead to frequent failures. Also the status of a parallel machine can be extremely complex, with thousands of cooperating threads, which can have multiple outstanding messages in transit.
It is therefore of paramount importance to develop new system software solutions to deal with the unavoidable reality of hardware and software faults. The Figure below shows the expected mean time between failures (MTBF) obtained from a simple performance model for systems with three different component reliability levels: MTBFs ranging from 10,000 to 1 million hours. As can be seen, when increasing the number of components in the systemthe MTBF of the entire system dramatically decreases. For projected peta-flosp system, the MTBF is only a few hours even when all are of very high reliability (MTBF of 1 million hours).
Our proposed solution is based on an automatic, frequent, user-transparent, and coordinated checkpoint and rollback-recovery mechanism. In essence, compute nodes are coordinated globally in order to checkpoint the computation state of the parallel program. In case of failure the non-functional portion of the machine is identified, a reallocation of resources is made, the compute nodes roll back to the most recently saved state, and the computation is continued.
Automatic and user-transparent checkpointing mechanisms are increasing in importance in large-scale parallel computing because the problems with user defined and managed checkpointing techniques only become worse with increasing system size.
We claim that highly scalable, highly efficient, transparent, and correct fault tolerance may be obtained using two mechanisms:
- Buffered CoScheduling (BCS) that enforces a sufficient degree of determinism on communication to make the design of efficient, scalable checkpointing algorithms tractable; and,
- Incremental checkpointing (TICK) that exploits the determinism imposed by Buffered CoScheduling. We implemented a prototype that performs incremental checkpointing in the Linux 2.6 kernel called TICK: Transparent Incremental Checkpointing at Kernel-level.