A practical approach to performance analysis and modeling of large-scale systems

Adolfy Hoisie and Darren J. Kerbyson

Performance and Architecture Lab (PAL), Pacific Northwest National Laboratory

Half day Tutorial ICS 2012

This tutorial presents a practical approach to the performance modeling of large-scale, scientific applications on high performance systems. The defining characteristic of our tutorial involves the description of a proven modeling approach, developed at PAL, of full-blown scientific codes, ranging from a few thousand to over 100,000 lines, that has been validated on systems containing 1,000's of processors. The goal is to impart a detailed understanding of factors contributing to the resulting performance of an application when mapped onto a given HPC platform. Performance modeling is the only technique that can quantitatively elucidate this understanding. We show how models are constructed and demonstrate how they are used to predict, explain, diagnose, and engineer application performance in existing or future codes and/or systems. Notably, our approach does not require the use of specific tools but rather is applicable across commonly used environments. Moreover, since our performance models are parametric in terms of machine and application characteristics, they imbue the user with the ability to "experiment ahead" with different system configurations or algorithms/coding strategies. Both will be demonstrated in studies emphasizing the application of these modeling techniques including: verifying system performance, comparison of large-scale systems, and examination of possible future systems.

Description

This tutorial presents a practical approach to the modeling of application performance in a system independent manner. This unique approach to analysis and modeling, developed over the last few years in the Performance and Architecture Laboratory (PAL), has been used and refined extensively. It has been highly successful in the modeling of many large-scale applications on a range of tera-scale systems. It can be applied to systems or applications that exist, or to those that are under design or being proposed. This tutorial draws upon a wealth of publications by the PAL team including papers at SC'03, SC'04, SC'05, SC'06, SC'08, SC'11, IPDPS, Concurrency and Computation, Parallel Processing Letters, Int. J. High Performance Computing Applications, IEEE Computer, IEEE Micro, J. Supercomputing, and Future Generation Computing Systems.

The overarching goal is to understand the expected performance of a particular algorithm or application when mapped onto a given HPC platform. Performance modeling is the only technique that can quantitatively elucidate this mapping. Through this tutorial it will be shown how performance modeling can be used to provide insight in such key areas as:

accurately estimating the overall workload performance from a prospective new computer system;
distinguish between system 'glitches' as opposed to true application performance issues;
accurately identifying the performance bottlenecks in existing systems;
providing a tuning "roadmap" to application developers;
enabling "point-design" studies for computer architects designing new systems.

It will be shown how an analytically based modeling approach can be used to explore the performance of different architectural scenarios with reasonable accuracy and time constraints. This approach does not require prior knowledge of specific tools, or lengthy simulation or evaluation processes. Analytical based performance prediction has been shown to be a valuable tool in successfully providing performance expectations on many of the compute intensive applications representative of the DOE ASC, DOE Office of Science, DARPA and NSF workloads.

The tutorial encompasses important definitions for analyzing performance, and also rigorous performance metrics for both serial and parallel considerations. The main content of the tutorial will be split between two aspects: modeling system characteristics, and modeling workload (application) characteristics.

System characteristics - This includes the computational capability of a single processing node - the CPU, memory hierarchy, node configuration, and inter-processor communication.

Workload Characteristics - This includes the resources that are used by the applications, their frequency, their potential for resource contention, scalability effects etc.

The approach will be exemplified in the tutorial using real world applications. We will not emphasize any particular machine but rather use as examples widely used parallel systems such as Roadrunner, Blue Gene/L, ORNL's Jaguar, and cluster systems. In particular two detailed case studies will be given based on real experiences from large-scale applications. Applications included will cover a large spectrum of different performance characteristics: a structured grid application (Sweep3D), and an adaptive mesh application (SAGE). The formation of models of these codes will be detailed along with techniques that can be applied to identify and understand relevant performance issues. Importantly, the value of the performance modeling approach will be illustrated through the use of the models in the following ways:

Prediction of new system configurations (using Roadrunner as examples)
Prediction of new application configurations (examining different data decompositions)
Optimal application parameter specification
Verification of achieved system performance (method and analysis was awarded best paper at SC'03).
Use of Accelerators in large-scale systems

We will show that performance modeling is a key element in the building of performance engineered applications and architectures. Models add insight into the performance of current systems, revealing bottlenecks and showing where tuning efforts would be most effective. It also allows for prediction of performance on future systems to be explored. The latter is important for both application and system architecture design as well as for the procurement of supercomputer architectures

Target Audience

This tutorial is intended for a mixture of computational scientists, computer scientists, and code developers interested in understanding the observed performance when using "real-life" applications on high performance systems. By carefully defining terms and metrics there should be no barriers associated with the diverse audience and an in-depth understanding of the issues will be provided which will be relevant to all backgrounds. The tutorial will also be of interest to those trying to define needs for future-generation, high-end computing systems from both a procurer's or designer's point of view.

Content Level: 30% Beginner, 50% Intermediate, 20% Advanced.

Brief Biographies

Adolfy Hoisie is a laboratory Fellow, the group leader the HPC Group, and Director of the Center for Advanced Architectures at PNNL. He spent 14 years at Los Alamos where he directed the Center for Advanced Architectures and Usable Supercomputing as well as the Advanced Computing Laboratory. From 1987 to 1997, he was a researcher at Cornell University. His area of research is performance evaluation of high-performance architectures. He has published extensively, lectured at numerous conferences and workshops, often as an invited speaker, taught tutorials in this field at important events worldwide, and organized numerous workshops. He was the winner of the Gordon Bell Award in 1996, and co-author to the SIAM monograph on performance optimization, and the edited volume on Engineering the Grid.

Darren Kerbyson is a laboratory Fellow in High Performance Computing at PNNL. He spent 10 years at Los Alamos at the lead of the Performance and Architecture Lab (PAL). He received his BSc in Computer Systems Engineering in 1988, and PhD in Computer Science in 1993 both from the University of Warwick (UK). Between 1993 and 2001 he was a Senior Faculty member of Computer Science at Warwick. His research interests include performance evaluation, performance and power modeling, and optimization of applications on high performance systems as well as image analysis. He has published over 140 papers in these areas over the last 20 years. He is a member of the IEEE Computer Society.