HPCaML 2019

The First International Workshop on
the Intersection of High Performance Computing and Machine Learning

February 16 1:00-5:00pm, 2019 @ Washington DC, USA.
Held in conjunction with the International Symposium on Code Generation and Optimization (CGO’19),
co-located with PPoPP, HPCA, and CC.


audience
------------------------------

Program

------------------------------

Keynote Talk: Machine-Learning-Based Performance Modeling and Tuning for High-Performance Computing
Time: 1:00 - 1:45pm
Speaker: Prasanna Balaprakash (Argonne National Laboratory)
Abstract: Heterogeneous nodes, many-core processors, deep memory hierarchies, energy efficiency demands, and performance variability make application and system management on high-performance computing systems an increasingly daunting task. Current strategies provided by operating and runtime systems are mostly static and present several challenges for porting and running applications at extreme-scale. The key challenge consists in finding new proactive and predictive methodologies that will that support automated refinements of application mapping on extreme-scale systems. In this talk, we will present our work on machine learning approaches for modeling and tuning the performances of compute, communication, and I/O subsystems. In particular, we will focus on automated data-driven performance modeling, Bayesian approaches for modeling the I/O variability, and autotuning search. We will end the talk with perspectives and research challenges on designing reinforcement-learning-based self-improving systems that can observe, predict, and optimize the overall performances of the applications and system automatically over time.
Bio: Prasanna Balaprakash is a computer scientist with a joint appointment in the Mathematics and Computer Science Division and the Leadership Computing Facility at Argonne National Laboratory. His research interests span the areas of artificial intelligence, machine learning, optimization, and high-performance computing. His research focuses on the development of scalable, data-efficient machine learning methods for scientific applications. He is a recipient of U.S. Department of Energy 2018 Early Career Award. Prior to Argonne, he worked as a Chief Technology Officer at Mentis Sprl, a machine learning startup in Brussels, Belgium. He received his PhD from CoDE-IRIDIA (AI Lab), Université Libre de Bruxelles, Brussels, Belgium, where he was a recipient of Marie Curie and F.R.S-FNRS Aspirant fellowships.

Talk 1: DVM: A Deep Learning Compilation Framework [SLIDES]
Time: 1:45 - 2:10pm
Authors: Jack Lee (University of Toronto), Amy Wang (Huawei Canada)
Abstract: This paper presents the design of DVM, a deep learning compilation framework. DVM provides facilities to optimize and transform neural network descriptions into program binary across heterogeneous hardware.

Talk 2: Accelerating Reduction Using Tensor Core Units
Time: 2:10 - 2:35pm
Authors: Abdul Dakkak (UIUC), Cheng Li (UIUC), Jinjun Xiong (IBM), Wen-Mei Hwu (UIUC)
Abstract: Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs come under the guise of different marketing terms and are capable of performing matrix multiplications on small matrices (usually 4x4 or 16x16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization — with only general matrix-matrix multiplication (GEMM) being supported. This limits their applicability to general algorithms and makes them confined to narrowly specialized libraries and application domains. In this work, we leverage NVIDIA's TCU to express reduction in terms of matrix multiplication and show the benefits — in terms of program simplicity, efficiency, and performance compared to start-of-the-art reduction methods on the GPU. Although this work targets GPUs, the motivation, methods, and observations are applicable to a wide number of TCU implementations and microarchitectures.

---------------- Coffee Break (20 mins) ----------------

Talk 3: FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference
Time: 2:55 - 3:20pm
Authors: Daya Khudia, Jianyu Huang, Protonu Basu, Summer Deng, Haixin Liu, Jongsoo Park, Mikhail Smelyanskiy (Facebook, Inc)
Abstract: Deep learning models typically use single-precision (FP32) floating point data types for representing activations and weights, but a slew of recent research work has shown that computations with reduced-precision data types (FP16, 16-bit integers, 8-bit integers or even 4- or 2-bit integers) are enough to achieve same accuracy as FP32 and are much more efficient. Therefore, we designed FBGEMM, a high-performance kernel library, from ground up to perform high-performance quantized inference on current generation CPUs. FBGEMM achieves efficiency by fusing common quantization operations with a high-performance GEMM implementation and by shape- and size-specific kernel code generation at runtime. The library has been deployed at Facebook, where it delivers greater than 2× performance gains with respect to our current production baseline.

Talk 4: NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks [SLIDES]
Time: 3:20 - 3:45pm
Authors: Probir Roy (William & Mary), Shuaiwen Song (PNNL), Sriram Krishnamoorthy (PNNL), Dipanjan Sengupta (Intel), Xu Liu (William & Mary)
Abstract: Convolution Neural Networks (CNNs) have become increasingly popular in industry and academia for their powerful capability in pattern classification, image processing, and speech recognition. Current state-of-the-art deep learning frameworks, such as variants of Caffe, have reported promising performance in speedup and scalability on GPU implementations. However, modern CPU-based multi- and many-core architectures employ Non-Uniform Memory Access (NUMA) technique to integrate multiple sockets, which incurs unique challenges for designing highly efficient CNN frameworks. Without careful design, DNN frameworks can easily suffer from long memory latency due to a large number of memory accesses to remote NUMA domains, resulting in poor scalability. To address this challenge, we propose NUMA-aware multi-solver based CNN design, named NUMA-Caffe, for accelerating deep learning neural networks on multi- and many-core CPU architectures. NUMA-Caffe is independent of DNN network topology, does not impact network convergence rates, and provides superior scalability to the existing Caffe variants. Through a thorough empirical study on four contemporary NUMA-based multi- and many-core architectures, our experimental results demonstrate that NUMA-Caffe significantly outperforms the state-of-the-art Caffe designs in terms of both throughput and scalability.

Talk 5: Auto-tuning Parallel Sparse Matrix-Matrix Multiplication by Deep Learning
Time: 3:45 - 4:10pm
Authors: Zhen Xie, Xin He, Weifeng Liu (China University of Petroleum), Guangming Tan, Ninghui Sun (Institute of Computing Technology, Chinese Academy of Sciences)
Abstract: Sparse Matrix-Matrix Multiplication (SpGEMM) is a widely used sparse kernel in a number of scientific applications. In this work, we first provide a prospective study on format-specific parallel SpGEMM algorithms, and propose a deep learning model named MatNet, which is trained by all the matrices from the SuiteSparse Matrix Collection, to quickly and accurately predict the best format and algorithm by feature parameters and density representation.

Talk 6: Auto-tuning TensorFlow Threading Model for CPU Backend [SLIDES]
Time: 4:10 - 4:35pm
Authors: Niranjan Hasabnis (Intel)
Abstract: TensorFlow* is a popular deep learning framework used by data scientists to solve a wide-range of machine learning and deep learning problems such as image classification and speech recognition. It also operates at a large scale and in heterogeneous environments — it allows users to train neural net- work models or deploy them for inference using GPUs, CPUs, (Intel ® Xeon ®† CPUs) and deep learning specific custom-designed hardware such as TPUs. Even though TensorFlow supports a variety of optimized backends, realizing the best performance using a backend may require additional efforts. For instance, getting the best performance from a CPU backend requires careful tuning of its threading model. Unfortunately, the best tuning approach used today is manual, tedious, time-consuming, and, more importantly, may not guarantee the best performance.
In this paper, we develop an automatic approach, called TENSORTUNER, to search for optimal parameter settings of TensorFlow’s threading model for CPU backends. We evaluate TENSORTUNER on both Eigen and Intel’s MKL CPU backends using a set of neural networks from TensorFlow’s benchmarking suite. Our evaluation results demonstrate that the parameter settings found by TENSORTUNER produce 2% to 123% performance improvement for the Eigen CPU backend and 1.5% to 28% performance improvement for the MKL CPU backend over the performance obtained using their best-known parameter settings. This highlights the fact that the default parameter settings in Eigen CPU backend are not the ideal settings; and even for a carefully hand-tuned MKL backend, the settings are sub-optimal. Our evaluations also revealed that TENSORTUNER is efficient at finding the optimal settings — it is able to converge to the optimal settings quickly by pruning more than 90% of the parameter search space.

Talk 7: Code Region Based Auto-Tuning Enabled Compilers [SLIDES]
Time: 4:35 - 5:00pm
Authors: Michael Kalyan, Xiang Wang, Ahmed Eltantawy, Yaoqing Gao (Huawei Canada)
Abstract: Auto-tuning is a desirable way to improve the performance of compilers as it reduces the work required by compiler developers. Typical compiler auto-tuning utilizes a driver to search the optimization space exposed by the compiler in the form of compiler flags. The driver selects a valid configuration that achieves the best performance within a given tuning budget. The compiler has no direct communication with the auto-tuning driver. The compiler configuration space is limited by the configurations exposed by the compiler and by hard-coded constraints. In this paper, we argue that to maximize the benefit of autotuning in compilers, the compiler has to be designed and implemented with auto-tuning in mind. To explore this, we extend parts of a traditional LLVM compiler to be auto-tuning enabled by exposing tuning opportunities on a code region basis, and allowing them to be applied separately for each code region. The compiler reports the valid configurations for each tuning opportunity. We enable auto-tuning in three different parts of the compiler: phase ordering on a module basis, loop unrolling on a loop basis, and machine instruction scheduling on a basic block basis. We achieve up to 1.196x performance gain over standard optimization and coarse grained tuning in 42 or fewer tuning iterations.


------------------------------

Call for Papers

------------------------------

In the last decade, machine learning has shown great power in solving many complex problems, such as image classification, speech recognition, auto-driving, machine translation, natural language processing, game playing, and healthcare analytics. Recently, it also attracts attention from scientific computing areas, including quantum chemistry, quantum physics, and mechanics, to develop domain-aware machine learning algorithms. To satisfy these broad needs, machine learning algorithms demand massive computing power, fast response time, and also low energy consumption. Innovations of both hardware design and software support are imperative.

From the other side, scientific, data-intensive, and also machine learning applications and algorithms need meticulous parameter tuning to achieve remarkable performance. It usually contains a huge tuning space for performance optimization. Such a space consists of various input features, algorithm variants, accuracy needs, and hardware platform impacts, etc. Machine learning is a good tool to automate this tuning process and maximize performance gains by traversing the tuning space without much human intervention, ensuring to draw the optimal while conserving portability and productivity.

The International Workshop on the Intersection of High Performance Computing and Machine Learning (HPCaML) is a new workshop targeting on research at their interpenetration effect: HPC-powered ML and ML-motivated HPC. The major objective is to bring researchers from these two domains to communicate their ideas, share knowledge of advanced technologies and new development on but not limited to the following topics:

  • Performance optimization of machine learning algorithms
  • Programming models and tools for machine learning
  • Machine learning model compression algorithms
  • Hardware-aware machine learning model synthesis
  • Power-efficient algorithms for machine learning
  • Specialized hardware architecture for machine learning
  • Machine learning based performance tuning
  • Machine learning based compiler techniques
  • Machine learning based power efficient algorithms
------------------------------

Important Dates

------------------------------

Paper Submission: December 14, 2018 December 21, 2018
Author Notification: January 14, 2019 January 21, 2019
Workshop: February 16, 2019
All dates are Anywhere on Earth (AOE).

------------------------------

Submission

------------------------------

Submission Site: https://easychair.org/conferences/?conf=hpcaml19

As a “fresh” workshop, we plan to make it more discussion-oriented. Papers describing in-progress or recently published work with innovative ideas are both welcomed. We invite 2-page double-column with 10-point font for submission, excluding references, appendices. Please follow the ACM proceeding sigconf template (https://www.acm.org/publications/proceedings-template). Kindly note that the submission will not appear in any proceedings so it can be further developed and submitted to a formal conference or journal for publication.

------------------------------

Organization

------------------------------

Chairs:

Jiajia Li, Pacific Northwest National Laboratory (Jiajia.Li@pnnl.gov)
Guoyang Chen, Alibaba Group US Inc. (gychen1991@gmail.com)
Shuaiwen Leon Song, Pacific Northwest National Laboratory (Shuaiwen.Song@pnnl.gov)
Guangming Tan, Institute of Computing Technology, Chinese Academy of Sciences (tgm@ncic.ac.cn)
Weifeng Zhang, Alibaba Group US Inc. (weifeng.z@alibaba-inc.com)

Committee:

Prasanna Balaprakash, Argonne National Laboratory
Aparna Chandramowlishwaran, UC Irvine
Shuai Che, Alibaba Group US Inc.
Guoyang Chen, Alibaba Group US Inc.
Jee Choi, IBM TJ Watson
Ang Li, Pacific Northwest National Laboratory
Jiajia Li, Pacific Northwest National Laboratory
Yingmin Li, Alibaba Group
Weifeng Liu, Norwegian University of Science and Technology
Xiaoyong Liu, Alibaba Group
Xu Liu, College of William and Mary
P. (Saday) Sadayappan, Ohio State University
Albert Sidelnik, NVIDIA Research
Shaden Smith, Intel Corporation
Jimeng Sun, Georgia Institute of Technology
Daniel Wong, University of California, Riverside
Hongxu Yin, Princeton University
Peng Zhou, Alibaba Group