Ian Hinder (Albert Einstein Institute), Erik Schnetter (Perimeter Institute for Theoretical Physics), Gabrielle Allen, Steven Brandt, Frank Loeffler (Lousiana State University)
Code optimization is typically in the domain of the compiler. However, compiler optimizations often fall short. One reason is that compiler input (e.g. C or Fortran) is a very low-level representation of the problem. Another reason is that compiler optimizers are black boxes that a user cannot reasonably understand or control, and also cannot modify or extend.
Despite this difficulty, understanding and controlling code optimization is crucial on modern CPU architectures, where performance is determined by many and non-orthogonal features. Without vectorization, for example, the theoretical peak performance of a code is reduced by a factor of at least two.
Unfortunately, the development of simulation software which is highly optimized and efficient when running on today's HPC systems falls outside the general expertise of the users who wish to write and run such software.
We have implemented an automated code generation system which constructs components for the Cactus framework (http://cactuscode.org) for solving time-dependent partial differential equations. This system provides a problem-domain API for users to specify the equations to solve and the numerical methods to use, and automatically constructs the low-level optimized code. In this way, we reduce user-visible code complexity and provide control over code optimization.
We implement our code generation framework, Kranc (http://kranccode.org), in Mathematica (MMa). MMa is a well-known and readily available software package that supports mathematical operations, allows equations to be written in a convenient notation (including support for abstract index notation for vectors and tensors), and supports pattern matching which is a natural way to transform code and apply optimizations. Starting from a high-level description of the system of equations and its variables, Kranc generates not only loop kernels or subroutines, but complete Cactus components including their interfaces. Since there is a one-to-one correspondence between the source script and the generated module, the source script provides a complete representation and the user never needs to look at the generated code.
Though the system is completely general, our primary use-case is the Einstein Toolkit, where codes generated using our system calculate the general relativistic geometry of spacetimes in various astrophysical scenarios. Given our large loop kernel, compilers refuse to vectorize it automatically. For codes with similarly difficult loops, Kranc offers a chance to double performance.
Complementary to our code generation system, we have designed and implemented an architecture-independent run-time API that is targeted by our vectorizer, and which we have used for manual vectorization as well. Apart from vectorized operations and data types, this API also addresses issues of alignment and efficient cache access. All production-ready codes generated using the Kranc/Cactus framework can now be regenerated to take advantage of this optimization and see real-world gains in productivity.
In addition, our framework includes a preprocessor based on a parsing expression grammar. This tool aids in debugging the MMa input (important for non-experts in MMa).
Our full presentation will describe how authors of simulation software can use Kranc and focus on the high-level equations and numerical methods, instead of the complexities of modern HPC programming.
G. Allen acknowledges that this material is based upon work supported while serving at the National Science Foundation.
Oleg Batrashev, Eero Vainikko (University of Tartu)
There is a lot of e.ort to make programming for HPC more productive and we are to make our contribution. After gain- ing some experience in programming preconditioned itera- tive solvers in Fortran and MPI we propose new approach, that is based on the mixed ideas from vector parallel lan- guages and parallelizing compilers like HPF. We follow two rules, .rst try to vectorize our code as much as possible and then apply static analysis techniques to the new represen- tation to get parallelized code automatically. This paper describes our motivation and walks through major steps of our approach without looking into much details. Our wish is to get feedback from the community in early stage of the research.
Bryan Marker, Don Batory, Jack Poulson, Robert Van De Geijn (The University of Texas at Austin)
Ideally, a domain-specific language (DSL) allows one to code at the same level of abstraction as one reasons about a domain problem. For dense linear algebra (DLA), we demonstrate how an appropriately chosen DSL can automate that reasoning. Elemental is a modern DLA C++ API for distributed-memory architectures that is a successor of ScaLAPACK. It embodies a DSL for the DLA domain and is representative of the FLAME API, which can be viewed as a more general DSL for DLA. We use Elemental to demonstrate the power of a well-structured DSL.
When an expert approaches a DLA algorithm to implement in code, she (implicitly or explicitly) chooses an initial sequential algorithm then, step-by-step, parallelizes and optimizes the algorithm and corresponding code to reach a final, optimized version. This process is very systematic and is repeated for most algorithms in the domain. In fact the process is now sufficiently well-understood that it can be (and has been) automated. With Elemental, we demonstrate automated reasoning by embracing ideas from model-driven engineering (MDE). With MDE we encode knowledge about the operations and algorithms in the DSL and the target architecture. With that knowledge we automate the transformation from algorithm to highly-optimized code.
We report on a prototype that takes a high-level DLA algorithm, applies transformations to it, and outputs optimized code for that algorithm in the Elemental DSL. We show that modularity and abstraction a afforded by Elemental enabled us to encode knowledge about the language's constructs (e.g. computation operations). We also encode knowledge about the target architecture and the common parallelization methods for it. Together, knowledge about the architecture and the DSL operations enable our system to generate many (sometimes thousands) implementations for an input algorithm. The system then uses cost predictions of the operations to choose the most efficient implementations. Early results for a handful of case studies are output codes that are the same as those hand-produced by an expert. Sometimes, the resulting code is better because the human expert is limited by time, effort, and complexity.
We present the DSL used for this project, the prototype system we developed, and how Elemental's layering enabled our success. We also explain that our work is an example of an approach that extends to the more general FLAME DSL for DLA. Our work also illustrates next-generation libraries: they should not be developed as instantiations in code. Rather, they should exist as an encoding of algorithms, knowledge about algorithms, and knowledge about target architectures. Optimizing transformations are then applied at a higher-level of abstraction than compilers, so more information about the algorithm is used. Instantiations are a final step that produces executable code. Other examples of this approach can be found in the Spiral and FENICS projects.
Jeff Daily (Pacific Northwest National Laboratory) and Robert R. Lewis (School of EECS, Washington State University)
The NumPy module is the de facto standard for numerical calculation in the Python programming language, a language whose use is growing rapidly in the scientific and engineering communities. NumPy provides a powerful N-dimensional array class as well as other scientific computing capabilities. However, like the majority of the core Python modules, NumPy is inherently serial.
Global Arrays (GA) is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared-memory programming interface to manipulate distributed dense arrays. Using a combination of GA and NumPy, we have reimplemented NumPy as a distributed drop-in replacement called Global Arrays in NumPy (GAiN). This allows certain NumPy applications to leverage GA almost transparently. This also allows new applications to be developed using a combination of GA and GAiN if the NumPy API proves insufficient in some cases.
GAiN's past, present, and future will be presented. We will cover its design, limitations, interesting hacks and path forward, as well as the challenges of transparent parallelism. Scalability studies will also be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.
Jeff Hammond, Eugene Deprince (Argonne National Laboratory)
The Tensor Contraction Engine (TCE) is an enormously successful project in creating a domain-specific language for quantum many-body theory with an associated code generator for the massively-parallel computational chemistry package NWChem. This collection of tools has enabled hundreds of novel scientific simulations running efficiently on many of the largest supercomputers in the world. This talk will first recount five years of experience developing simulation capability with the TCE (specifically, response properties) and performance analysis of its execution on leadership-class supercomputers, summarizing its many successes with constructive criticism of its few shortcomings. Second, we will describe our recent investigation of quantum many-body methods on heterogeneous compute nodes, specifically GPUs attached to multicore CPUs, and how to evolve the TCE for the next generation of multi-petaflop supercomputers, all of which which feature multicore CPUs and many of which will be heterogeneous. We will describe new domain-specific libraries and high-level data structures that can couple to automatic code generation techniques for improved productivity and performance as well as our efforts to implement them.
Beverly Sanders, Erik Deumens (University of Florida), Victor Lotrich (ACES QC) and Nakul Jindal (University of Florida)
Important classes of problems in computational chemistry, notably coupled cluster methods, consist of solutions to complicated expressions defined in terms of tensors. Tensors are represented by multidimensional arrays that are typically extremely large, thus requiring distribution or in some cases backing on disk. We describe a parallel programming environment, the Super Instruction Architecture (SIA) comprising a domain specific programming language SIAL and its runtime system SIP that are specialized for this class of problems. A novel feature of the programming language is that SIAL programmers express algorithms in terms of operations on blocks rather than individual floating point numbers. Efficient implementations of the block operations as well as management of memory, communication, and I/O are provided by the runtime system. The system has been successfully used to develop ACES III, a software package for computational chemistry.
Lukasz G. Szafaryn (University of Virginia), Todd Gamblin, Bronis R. De Supinski (Lawrence Livermore National Laboratory), Kevin Skadron (University of Virginia)
The increasing computational needs of parallel applications inevitably require portability across popular parallel architectures, which are becoming heterogeneous. The lack of a common parallel framework results in divergent code bases, difficulty in porting, higher code maintenance cost, and, thus difficulty achieving optimal performance on target architectures.
The paper examines two representative parallel applications and describes code structuring and annotations required to arrive at a single code base that is parallelizable across representative heterogeneous architectures, such as many-core CPU and GPU. Drawing on previous work in the area, we create a common set of directives and apply them to the codes to illustrate the concept of a unified framework. The execution on the two architectures is implemented by translating these directives to OpenMP and the PGI accelerator API.
Our work demonstrates that we can use a common high-level framework to annotate a common code base sufficiently to execute on heterogeneous architectures efficiently. Our results show that the correct use of our framework with state-of-the-art parallelizing compilers yields comparable performance to that of a custom code or a native language. Moreover, we illustrate that the approach results in increased programmability, reduced code size and decreased maintenance cost.
Justin Holewinski, Thomas Henretty, Kevin Stock, Louis-Noel Pouchet, Atanas Rountev and P. Sadayappan (The Ohio State University)
While the high performance made possible by SIMD vector instruction sets and specialized GPU architectures is a boon to the HPC community, achieving portable performance across different systems from a single program is virtually impossible today.
High-level domain-specific languages offer the potential for high productivity for application developers, while simultaneously enabling performance portability via automatic transformation for efficient execution on different architectural platforms. This talk will present a domain-specific language for expressing stencil computations and initial performance data on multiple target GPU and CPU platforms, for several stencil computations expressed in the language.
Orion Lawlor (U. Alaska Fairbanks)
We present a high performance GPU programming language, based on OpenCL, that is embedded in C++. Our embedding provides shared data structures, typesafe kernel invocation, and the ability to more naturally interleave CPU and GPU functions, similar to CUDA but with the portability of OpenCL. For expressivity, our language provides an abstraction that releases control over data writes to the runtime system, which both improves expressivity and eliminates the chance of memory race conditions. We benchmark the new language on NVIDIA and ATI hardware for several small applications.
Please contact the program chairs at: firstname.lastname@example.org