Virtualization-based Transparent System-level Fault Tolerance
Overview
To satisfy the ever-increasing demand for processing power, larger and larger supercomputers are being built. But as the number of components in the system grows, the overall mean time between failures (MTBF) increases, reducing the reliability and availability of the system. In fact, the failure of a single component may impact a large fraction of the system and cause any application which was running on it to fail. If the application state is not stored redundantly, any loss of state is catastrophic.
A common counter-measure to faults is application-specific, user-level checkpointing, performed at regular intervals. This is an error-prone approach, and relies completely on the programmer. Virtualization has been proposed as an alternative means to pursue fault tolerance. In fact, a Virtual Machine (VM) can be easily checkpointed without the need for error-prone, manual instrumentation. With virtualization-based fault tolerance, in case of failure, the VM running on the failed node can be moved or restarted onto a spare node, and the parallel computation can continue without consequences.
Xen has been shown as an efficient paravirtualization environment, capable of providing migrateability at an acceptable performance overhead, within a few percent points. Checkpointing the VMs is not enough to provide fault tolerance to a parallel system, though. Since a number of messages could be in flight at the time of the checkpoint event, the network must also be brought to a quiescent state before checkpointing. The focus of this work is to fill this gap, providing checkpointing capabilities for parallel programs based on the partitioned global address space (PGAS) programming model, and running on a cluster which features the latest generation of the Infiniband network. To this purpose, we have integrated a Global Recovery Line (GRL) feature in our PGAS environment.
The experimental results show that it is possible to virtualize the communication and the computation with minimal overhead and to provide seamless migration capabilities.
Global Recovery Lines
The purpose of a Global Recovery Line (GRL) is to bring the entire HPC cluster to a quiescent, globally consistent state, which allows safe check-pointing or migration. Operations involved in the GRL are implemented in a scalable way.
A global recovery line requires the cooperation of four kinds of components: a Global Coordinator (GC), Master Processes (MP), non-master processes and Virtual Machine Managers (VMM).
A GRL provides with the ability to reach a global silence state. During this state, VMs can be freely checkpointed or migrated across nodes. Also, hardware interventions can take place, e.g. shutdown, service, reboot of a physical node.
Scalability of the terms composing the drain delay, i.e. the delay required to enter the global recovery line.
Scalability of the terms composing the resume delay, i.e. the delay required to continue normal execution after a global recovery line.
The GRL capabilities rely on a composite execution environment, in which a specialized Infiniband driver is employed. This driver, developed by Novell in while cooperation with PNNL for this project, is the only migrateable driver for Infiniband devices. In addition, we have enhanced the ARMCI (Aggregate Remote Copy Interface) included in our PGAS system in order to implement the GRLs as described above.
Performance
Though in its preliminary form, our work shows that it is indeed possible to virtualize all the resources of a processing node, including a high-performance communication interconnect like InfiniBand, with negligible overhead. The cost of a checkpoint/migration is minimal, and mostly affected by the speed of the I/O devices.
Our experimental results show that most components of our automatic global recovery line detection take just a few milliseconds and are insensitive to the characteristics of the user applications. Overall a node migration can be achieved in tens of milliseconds, a negligible delay if checkpoints are taken every few minutes.
Publications
More details on this work can be found in our paper "Transparent System-level Migration of PGAS Applications using Xen on InfiniBand" presented at the Cluster 2007 conference. A draft of the paper is available for download here.
