The efficient management of enormous and increasing volumes of data remains a challenging problem. Despite the improvements of storage capacities, the cost of moving data between the processing nodes and the storage devices has not improved at the same rate as the disk capacity. In recent years, several parallel file systems have been developed to tackle the problem of data management in the context of high-performance computing systems. Some of these parallel file systems, such as Lustre and PVFS, use mainstream server computers as I/O servers. The combined computing capacity of these hundreds, or even thousands, of storage nodes can be considerable. However, it is not usually exploited due to the role of these nodes as I/O elements that only store data.
One approach to reduce the bandwidth requirements between storage and compute devices, and to leverage the computing capacity of the storage nodes, is, when possible, to move computation closer to the storage devices. We call this approach Active Storage in context of parallel file systems. By offloading some computing tasks to the storage nodes, near to the data that they manage, Active Storage makes it possible to substantially reduce the data movement across the network and, hence, the overall network traffic.
Active Storage is targeted at applications with I/O-intensive stages that involve fundamentally-independent data sets. It can be used to process, either on-line or off-line, output files from scientific simulation runs. Some examples of tasks suitable for Active Storage include: compression and archival of output files, statistical analysis of the output data and storing the results in an external database, indexing the contents of the output files, simple data transformations such as unit conversion by multiplying by a scalar a set of numbers, etc. By performing these operations in the storage nodes, not only we achieve the aforementioned benefits with respect to the resource usage, but also we can exonerate scientific application programmers from implementing I/O tasks which are "oblivious" to the main application.
Figure 1. Active Storage reduces the network traffic, and leverages the computing capacity of the storage nodes, improving the overall system performance.
Figure 2. Time to multiple 121 million doubles by a scalar. The graph compares the time taken by a compute node to complete the task, and the time taken by Active Storage with 1, 2 or 4 OSTs to complete the same task (which is split into as many subtasks as OSTs). Without Active Storage, more than 2 GB of data are moved across the network. Active Storage reduces this amount to almost 0.
- This effort is a part of the DoE SciDAC SDM project.
Current Focus Areas
- Active Storage for Lustre
- Active Storage for PVFS
- Active Storage with striped files
- Jarek Nieplocha
- Juan Piernas-Canovas
- Evan J. Felix
- Juan Piernas, Jarek Nieplocha, "Efficient Management of Complex Striped Files in Active Storage", Proc. Europar'08. 2008. PDF
- Juan Piernas, Jarek Nieplocha, Evan J. Felix. "Evaluation of Active Storage Strategies for the Lustre Parallel File System". Proceedings of the Supercomputing'07 Conference, November, 2007. PDF
- Evan J. Felix, Kevin Fox, Kevin Regimbal, Jarek Nieplocha. "Active Storage Processing in a Parallel File System". 6th LCI International Conference on Linux Clusters: The HPC Revolution. Chapel Hill, North Carolina, on April 26, 2005. PDF