TICK: Transparent Incremental Checkpointing at Kernel-level
The primary potential problem of frequent, automatic, and user-transparent checkpointing and rollback recovery is the quantity of generated checkpoint data. Frequent checkpointing of the large memory footprints of scientific applications can quickly saturate available bandwidth and fill nonvolatile storage.
TICK is designed to be a modular building block for implementing checkpointing in large-scale Linux clusters. TICK is implemented primarily as a Linux 2.6.11 kernel module, and consists of about 4,000 lines of code, with an additional 400 lines in the static part of the kernel. TICK is neutral with respect to where checkpoint data is saved: its function is to correctly capture and restore process state. The actual management of the checkpoint data is handled by one or more separate agents, which in our prototype implementation are other Linux kernel modules. For lack of space this paper will not describe the algorithms for data movement that can be used to implement a global checkpointer for parallel applications. The checkpoint data may be saved locally if a process restart is all that is needed, for example after a machine crash, or to a file system when the instances of TICK on each CPU of a cluster are globally coordinated.
TICK is available as public domain software. The current version is a kernel pacth of Linux 2.6.11. DOWNLOAD TICK
This IEEE/ACM Supercomputing 2005 paper describes in more detail TICK DOWNLOAD SC05 PAPER
While our primary goal is fault tolerance in large-scale parallel computers, we believe that TICK could be useful in other environments such as distributed or grid computing, and much more directly, for load balancing via process migration. The essential properties of TICK are that it is:
- Kernel level: TICK is implemented at kernel level to allow unrestricted access to processor registers, memory allocation data structures, file descriptors, signals pending, etc.
- Implemented as a kernel module: Writing, debugging and maintaining kernel code can be time consuming and non-portable. Most of TICK's code is in a kernel module that can be loaded and removed dynamically.
- General purpose: The TICK checkpoint/restart mechanism works with any type of user process, and processes may be restarted on any node with the same operating environment.
- Flexibly initiated: The checkpointing mechanisms of TICK can be triggered by a local event, such as a timer, or a remote event, such as a global strobe, in a very short and bounded time interval.
- User transparent: The user processes are not involved in the checkpointing or restarting and there is no need to modify existing applications or libraries. This implies that TICK can support existing legacy software, written in any language, without any changes.
- Efficient: TICK tries to minimize degradation of performance of a user process when checkpointing its state. TICK also implements fast process restarts.
- Incremental: TICK can perform frequent incremental checkpointing on demand.
- Easy to use: TICK provides a simple interface based on the /proc file system that can be used by a user or system administrator to dynamically checkpoint or restart a user process on demand.
The usefulness of a tool such as TICK depends critically on its performance. We have chosen a set of scientific applications for our performance evaluation, BT, LU and SP taken from the NAS Suite of Benchmarks, and Sage and Sweep3D. In previous work, Sage was found to be the most demanding test for checkpointing algorithms because of its large memory footprint and lack of data locality.
The experimental platform is a dual-processor AMD Opteron cluster. Each processing node contains two AMD Opteron Model 246 processors, 3GB RAM, and a Seagate Cheetah 15K SCSI disk. The proposed checkpoint/restart mechanisms have been implemented in the Linux kernel version 2.6.11 with the page size configured to 4KB.
Runtime overhead of full checkpointing for various checkpoint intervals when storing the checkpoints to main memory, and also to the local disk for Sage and Sweep3D.
Runtime overhead of full checkpointing for various checkpoint intervals when storing the checkpoints to main memory for Sage and Sweep3D.
Runtime overhead of full checkpointing for various checkpoint intervals when storing the checkpoints to the local disk for Sage and Sweep3D.
When the checkpoints are stored in main memory every minute the worst case is only 4% with Sage-300MB. With disk checkpointing, the worst case is slightly greater, 6%.
TICK provides high responsiveness: the checkpoint can be triggered by an external event such as a global heartbeat in as little as 2.5 microseconds. It provides several mechanisms to implement incremental checkpointing at fine granularity with little overhead. It is also very modular, and allows quick prototyping of distributed checkpointing algorithms.
The experimental results, obtained on a state-of-the-art cluster, show that TICK can be used as a building block for various checkpointing algorithms. We have demonstrated that with TICK it is possible to implement frequent incremental checkpointing, with intervals of just a few seconds, with a run-time increase that is less than 10% in most configurations.