Title: The Past, Present and Future of HPC Astract: A broad perspective on the history and trends in high
performance computing will be presented. We will look at both
hardware and software trends.
Title: The Past, Present and Future of HPC Abstract: A broad perspective on the history and trends in high
performance computing will be presented. We will look at both
hardware and software trends.
Title: Heterogeneous Architecture Programming Abstract: This presentation aims at giving an overview of the state of the art of heterogeneous architecture programming.
First, we present current hardware technology and their characteristics in regards to code performance.
Then, we describe the most popular API such as CUDA, OpenCL and more high-level approaches such as
directive based API. (e.g. HMPP). Finally, we conclude this course by a methodology to help legacy code
migration. We illustrate this course with numerous examples of GPGPU applications.
Title: Heterogeneous Architecture Programming Abstract: This presentation aims at giving an overview of the state of the art of heterogeneous architecture programming.
First, we present current hardware technology and their characteristics in regards to code performance.
Then, we describe the most popular API such as CUDA, OpenCL and more high-level approaches such as
directive based API. (e.g. HMPP). Finally, we conclude this course by a methodology to help legacy code
migration. We illustrate this course with numerous examples of GPGPU applications.
Title: Programming dense linear algebra routines on multicore and multi-GPU architectures using runtime systems Abstract: The design of algorithms and routines that can exploit the potential of multicore and multi-GPU architectures is a preliminary step to exploit emergent massively parallel machines. We propose a twofold approach to tackle this challenge in the context of dense linear algebra operations. We first show that standard algorithms may have a limited degree of paralellism and present alternative algorithms that provide a higher degree of parallelism. The second step consists of implementing those algorithms. We show how we can benefit from advanced runtime systems (Quark, StarPU, ...) to schedule our algorithms on complex multicore or heterogeneous multi-GPU architectures, achieving high performance together with a very high productivity. We will show some demonstration codes and some experimental results.
Title: High performance numerical linear algebra kernels Abstract: In response to the combined hurdles of maximum power
dissipation, large memory latency, and little instruction-level
parallelism left to be exploited, all major chip manufacturers have
finally adopted multi-core designs as the only means to exploit the
increasing number of transistors dictated by Amdahl's Law. Thus,
desktop systems equipped with general-purpose four-core processors and
graphics processors (GPUs) with hundreds of fine-grained cores are
routine today. While these new architectures can potentially yield a
much higher performance, this usually will come at the expense of
tuning codes in a few cases or a complete rewrite in many others.
Linear algebra, which is ubiquitous in scientific and engineering
applications, is currently undergoing this change.
In this lecture we will review practical aspects of existing
sequential and parallel dense linear algebra libraries for these new
architectures. The lecture will be specially focused on GPUs,
inspecting the implementation of BLAS for NVIDIA processors, and
evaluating the implementation of LAPACK on top of these kernels. We
will also describe how dynamic data-driven scheduling also yields a
higher degree of parallelism for multi-GPU platforms and how to hide
the PCI-e latency by borrowing cache coherence techniques well-known
in computer architecture. Finally, we will offer a glimpse on the
parallelization of dense linear algebra libraries for clusters of
nodes equipped with GPUs.
Title: Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Abstract: The presentation investigates various aspects of the implementation of algebraic multigrid methods on GPU-accelerated hybrid architectures.
Furthermore, different parallelization strategies are investigated to achieve optimal performance on a broad range of hardware configurations.
Title: Toward Portable Programming of Numerical Linear Algebra on Manycore Nodes Abstract: Manycore architectures exist in several varieties, with more variations to come. Writing portable software during this transition period is challenging. In this talk we present approaches we have used on the Trilinos project to develop portable parallel linear algebra libraries and supporting interfaces. We discuss current efforts to identify and express optimal parallel algorithms that are portable across current manycore nodes and have a good chance to work well on future systems.
Title: Finte element methods on GPU systems Abstract: This session covers techniques for GPU-based Finite Element multigrid solvers. After presenting the necessary preliminaries, we outline the broader context in which the techniques can be applied. We introduce structured, block-structured and unstructured discretization grids for finite element, finite difference and finite volume techniques and discuss their computational and bandwidth requirements. Here, we also describe fine-grained parallelization techniques for the "finite element assembly", i.e., the construction of linear systems of equations from the discretization of a PDE on a mesh covering the computational domain.
The second part of our lecture covers another building block in finite element simulation software, namely iterative solvers for sparse linear systems. We briefly present the necessary building blocks from numerical linear algebra that are needed to implement solvers of multigrid and Krylov subspace type; and then focus on numerically powerful preconditioning and smoothing techniques and the general trade-off between recursive, inherently sequential numerically advantageous properties and scalable parallelization for fine-grained architectures such as GPUs. In addition, we present mixed precision methods as a generic performance improvement technique in the context of
iterative solvers.
The third part of the lecture is then concerned with integrating the GPU-based components from the previous parts into large-scale PDE software that executes on heterogeneous clusters. We discuss the benefits and drawbacks of a "minimally invasive" integration vs. a full re-implementation in such clusters, technical details of an efficient management of heterogeneous resources (scheduling, overlapping communication with computation, etc.) and present case studies for applications from fluid dynamics, solid mechanics and geophysics.
Title: Finte element methods on GPU systems Abstract: This session covers techniques for GPU-based Finite Element multigrid solvers. After presenting the necessary preliminaries, we outline the broader context in which the techniques can be applied. We introduce structured, block-structured and unstructured discretization grids for finite element, finite difference and finite volume techniques and discuss their computational and bandwidth requirements. Here, we also describe fine-grained parallelization techniques for the "finite element assembly", i.e., the construction of linear systems of equations from the discretization of a PDE on a mesh covering the computational domain.
The second part of our lecture covers another building block in finite element simulation software, namely iterative solvers for sparse linear systems. We briefly present the necessary building blocks from numerical linear algebra that are needed to implement solvers of multigrid and Krylov subspace type; and then focus on numerically powerful preconditioning and smoothing techniques and the general trade-off between recursive, inherently sequential numerically advantageous properties and scalable parallelization for fine-grained architectures such as GPUs. In addition, we present mixed precision methods as a generic performance improvement technique in the context of
iterative solvers.
The third part of the lecture is then concerned with integrating the GPU-based components from the previous parts into large-scale PDE software that executes on heterogeneous clusters. We discuss the benefits and drawbacks of a "minimally invasive" integration vs. a full re-implementation in such clusters, technical details of an efficient management of heterogeneous resources (scheduling, overlapping communication with computation, etc.) and present case studies for applications from fluid dynamics, solid mechanics and geophysics.
Title: Methods for performance evaluation and optimization on modern
HPC systems Abstract: General introduction to the performance analysis of parallel
codes, followed by a lesson on communication and synchronization
analysis using the Scalasca performance tool
(http://www.scalasca.org). The lecture will give an overview of
Scalasca, explain its functionality, and discuss case studies
demonstrating its various analysis modes.
Title: Methods for performance evaluation and optimization on modern
HPC systems Abstract: Paraver and Dimemas (www.bsc.es/paraver) are part of
CEPBA-Tools, an open source project developed by BSC. Paraver is based
on traces and can be used to analyze any information expressed on its
input trace format. Dimemas is a simulation tool for the analysis of
message-passing applications behavior on a configurable platform. The
talk would present the tools and examples of analysis illustrating the
level of details and insight that these tools provide.
Title: Programming heterogeneous, accelerator-based multicore machines:
a runtime system's perspective
Abstract: Heterogeneous accelerator-based parallel machines, featuring manycore
CPUs and with GPU accelerators provide an unprecedented amount of
processing power per node. Dealing with such a large number of
heterogeneous processing units -- providing a highly unbalanced
computing power -- is one of the biggest challenge that developpers of
HPC applications have to face. To Fully tap into the potential of
these heterogeneous machines, pure offloading approaches, that consist
in running an application on regular processors while offloading part
of the code on accelerators, are not sufficient.
In this talk, I will go through the major programming environments
that were specifically designed to harness heterogeneous
architectures, focusing on runtime systems. I will discuss some of the
most critical issues programmers have to consider to achieve
portability of performance, and I will show how advanced runtime
techniques can speed up applications in the domain of dense linear
algebra.
Eventually, I will give some insights about the main challenges
designers of programming environments will have to face in upcoming
years.
Title: Programming paradigms using PGAS-based languages Abstract: PGAS (Partitioned Global Address Space) languages propose an abstract and unified view of the computing architecture (multiple nodes, each with one or more computing cores) and memory architecture (private and shared on-node memory, remote memory direct access). Algorithm implementation on parallel machines is easier with these languages, which offer flexible programming models mixing task and data parallelism. But these languages are still in an experimental state and performance must be improved. We will present a few test cases with several available languages of this family
Title: GPU Accelerated Discontinuous Galerkin Methods Abstract: The discontinuous Galerkin (DG) methods can be viewed as a combination of finite volume and finite element methods, building high-order polynomial approximations on standard finite element meshes. However, the inter-element continuity of DG solutions is only weakly enforced. Discretized DG operators are typically sparse with dense sub blocks, and this structure makes them ideal candidates for general purpose graphics processing unit (GPGPU) implementations. We will describe how these methods map onto the current GPGPU thread hierarchy models prevalent in CUDA and OpenCL. After reviewing the basic formulation and implementations we will discuss how a new family of DG methods, for solving conservation laws on domains meshed with curvilinear elements, was motivated by GPGPU architectural considerations. Performance results and simulation results will be presented.
15h30 - 16h00
Pause café
16h00 - 17h30
Session 8 - 2ème partie
Luigi Genovese
CEA/INAC/SP2M
Laboratoire de Simulation Atomistique
luigi.genovese@esrf.fr
Title: Wavelet-Based DFT calculations on Massively Parallel Hybrid Architectures
Abstract: Electronic structure calculations (DFT codes) are certainly among the
disciplines for which an increasing of the computational power
correspond to an advancement in the scientific results.
In this contribution, we present an implementation of a full DFT code
that can run on massively parallel hybrid CPU-GPU clusters. Our
implementation is based on modern GPU architectures which support
double-precision floating-point numbers. This DFT code, named BigDFT,
is
delivered within the GNU-GPL license either in a stand-alone version
or
integrated in the ABINIT software package. Hybrid BigDFT routines were
initially ported with NVidia's CUDA language, and recently more
functionalities have been added with new routines writeen within
Kronos'
OpenCL standard.
The formalism of this code is based on Daubechies
wavelets, which is a systematic real-space based basis set. As we will
see in the presentation, the properties of this basis set are well
suited for an extension on a GPU-accelerated environment. In addition
to
focusing on the implementation of the operators of the BigDFT code,
this
presentation also relies of the usage of the GPU resources in a complex
code with different kinds of operations. A discussion on the interest
of
present and expected performances of Hybrid architectures computation
in
the framework of electronic structure calculations is also adressed.
Title: ProActive Hybrid Workflows with CPUs and GPUs Abstract: The presentation will give an overview of issues at hand when accelerating demanding applications with Multi-Cores, Clusters, Servers, GPUs and Clouds. The point will be illustrated with ProActive Parallel Suite, an OW2 Open Source
library for parallel, distributed, and concurrent computing, allowing to showcase Interactive GUI and tools.
Strong theoretical foundation ensures many properties for the resulting parallel programs. Typical SPMD programming will also be discussed.
An important aspect of HPC today is the capacity to appropriately map various parallel tasks onto
nowadays complex heterogeneous hardware (NUMA, Hybrid with CPU and GPU). We shall introduce how one can
control such mapping in a fine manner, potentially selecting all hardware and software specificity on nodes,
selecting proximity up to the latency between nodes.
Title:Large Eddy Simulation and Muti-physics for Engine computations on Massively Parallel Machines Abstract:
Efficient numerical tools taking advantage of the ever-increasing power of high-performance computers become key elements in the fields of energy supply and transportation, not only from a purely scientific point of view, but also at the design stage in industry.Indeed, flow phenomena that occur in or around the industrial applications such as gas turbines or aircraft are still not mastered. In fact, most Computational Fluid Dynamics (CFD)predictions produced today focus on reduced or simplified versions of the real systems and are usually solved with a steady state assumption. The presentation shows how recent developments of CFD codes and parallel computer architectures can help overcoming this barrier. With this new environment, new scientific and technological challenges can be addressed provided that thousands of computing cores are efficiently used in parallel. Strategies of modern flow solvers are discussed with particular emphases on mesh-partitioning, load balancing and communication. These concepts are used in CFD codes developed by CERFACS. Leading edge computations obtained with these high-end massively parallel CFD codes are illustrated and discussed in the context of aircrafts, turbo-machinery and gas turbine applications. Finally, current developments of multi-physics simulations based on code coupling are discussed. It is shown how code coupling can be used to provide leading edge tools that directly benefit from high performance computing with strong industrial implications at the design stage of the next generation of aircraft and gas turbines.