CEA-EDF-INRIA - Program

Toward petaflop numerical simulation on parallel hybrid architectures

june, 6-10, 2011
INRIA
Sophia Antipolis, France

with the support of

logo cnrs

logo my planet

Program

You could find on this page all the pdf files of the presentations.

Monday 6 june

8h15 - 8h45	Registration
8h45 - 9h00	Introduction
9h00 - 10h30	Session 1 - 1rst part
	Jack Dongarra University of Tennessee dongarra@cs.utk.edu Title: The Past, Present and Future of HPC Abstract: A broad perspective on the history and trends in high performance computing will be presented. We will look at both hardware and software trends.
10h30 - 11h00	Coffee break
11h00 - 12h30	Session 1 - 2nd part
	Jack Dongarra University of Tennessee dongarra@cs.utk.edu Title: The Past, Present and Future of HPC Abstract: A broad perspective on the history and trends in high performance computing will be presented. We will look at both hardware and software trends.
12h30 - 14h00	Lunch
14h00 - 15h30	Session 2 - 1rst part
	François Bodin CAPS entreprise bodin@irisa.fr Title: Heterogeneous Architecture Programming Abstract: This presentation aims at giving an overview of the state of the art of heterogeneous architecture programming. First, we present current hardware technology and their characteristics in regards to code performance. Then, we describe the most popular API such as CUDA, OpenCL and more high-level approaches such as directive based API. (e.g. HMPP). Finally, we conclude this course by a methodology to help legacy code migration. We illustrate this course with numerous examples of GPGPU applications.
15h30 - 16h00	Coffee break
16h00 - 17h30	Session 2 - 2nd part
	François Bodin CAPS entreprise bodin@irisa.fr Title: Heterogeneous Architecture Programming Abstract: This presentation aims at giving an overview of the state of the art of heterogeneous architecture programming. First, we present current hardware technology and their characteristics in regards to code performance. Then, we describe the most popular API such as CUDA, OpenCL and more high-level approaches such as directive based API. (e.g. HMPP). Finally, we conclude this course by a methodology to help legacy code migration. We illustrate this course with numerous examples of GPGPU applications.
17h30 - 18h00	Debriefing

Tuesday, 7 june

9h00 - 10h30	Session 3 - 1rst part
	Emmanuel Agullo INRIA Bordeaux – Sud-Ouest emmanuel.agullo@inria.fr Julien Langou University of Colorado at Denver julien.langou@ucdenver.edu Title: Programming dense linear algebra routines on multicore and multi-GPU architectures using runtime systems Abstract: The design of algorithms and routines that can exploit the potential of multicore and multi-GPU architectures is a preliminary step to exploit emergent massively parallel machines. We propose a twofold approach to tackle this challenge in the context of dense linear algebra operations. We first show that standard algorithms may have a limited degree of paralellism and present alternative algorithms that provide a higher degree of parallelism. The second step consists of implementing those algorithms. We show how we can benefit from advanced runtime systems (Quark, StarPU, ...) to schedule our algorithms on complex multicore or heterogeneous multi-GPU architectures, achieving high performance together with a very high productivity. We will show some demonstration codes and some experimental results.
10h30 - 11h00	Coffee break
11h00 - 12h30	Session 3 - 2nd part
	Enrique S. Quintana Ortí Universidad Jaime I quintana@icc.uji.es Title: High performance numerical linear algebra kernels Abstract: In response to the combined hurdles of maximum power dissipation, large memory latency, and little instruction-level parallelism left to be exploited, all major chip manufacturers have finally adopted multi-core designs as the only means to exploit the increasing number of transistors dictated by Amdahl's Law. Thus, desktop systems equipped with general-purpose four-core processors and graphics processors (GPUs) with hundreds of fine-grained cores are routine today. While these new architectures can potentially yield a much higher performance, this usually will come at the expense of tuning codes in a few cases or a complete rewrite in many others. Linear algebra, which is ubiquitous in scientific and engineering applications, is currently undergoing this change. In this lecture we will review practical aspects of existing sequential and parallel dense linear algebra libraries for these new architectures. The lecture will be specially focused on GPUs, inspecting the implementation of BLAS for NVIDIA processors, and evaluating the implementation of LAPACK on top of these kernels. We will also describe how dynamic data-driven scheduling also yields a higher degree of parallelism for multi-GPU platforms and how to hide the PCI-e latency by borrowing cache coherence techniques well-known in computer architecture. Finally, we will offer a glimpse on the parallelization of dense linear algebra libraries for clusters of nodes equipped with GPUs.
12h30 - 14h00	Lunch
14h00 - 15h30	Session 4 - 1rst part
	Manfred Liebmann Karl-Franzens-University Graz manfred.liebmann@uni-graz.at Title: Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Abstract: The presentation investigates various aspects of the implementation of algebraic multigrid methods on GPU-accelerated hybrid architectures. Furthermore, different parallelization strategies are investigated to achieve optimal performance on a broad range of hardware configurations.
15h30 - 16h00	Coffee break
16h00 - 17h30	Session 4 - 2nd part
	Michael Heroux Sandia National Labs. maherou@sandia.gov Title: Toward Portable Programming of Numerical Linear Algebra on Manycore Nodes Abstract: Manycore architectures exist in several varieties, with more variations to come. Writing portable software during this transition period is challenging. In this talk we present approaches we have used on the Trilinos project to develop portable parallel linear algebra libraries and supporting interfaces. We discuss current efforts to identify and express optimal parallel algorithms that are portable across current manycore nodes and have a good chance to work well on future systems.
17h30 - 18h00	Debriefing

Wednesday 8 june

9h00 - 10h30	Session 5 - 1rst part
	Robert Strzodka Max Planck Institut Informatik et Stanford University strzodka@mpi-inf.mpg.de Title: Finte element methods on GPU systems Abstract: This session covers techniques for GPU-based Finite Element multigrid solvers. After presenting the necessary preliminaries, we outline the broader context in which the techniques can be applied. We introduce structured, block-structured and unstructured discretization grids for finite element, finite difference and finite volume techniques and discuss their computational and bandwidth requirements. Here, we also describe fine-grained parallelization techniques for the "finite element assembly", i.e., the construction of linear systems of equations from the discretization of a PDE on a mesh covering the computational domain. The second part of our lecture covers another building block in finite element simulation software, namely iterative solvers for sparse linear systems. We briefly present the necessary building blocks from numerical linear algebra that are needed to implement solvers of multigrid and Krylov subspace type; and then focus on numerically powerful preconditioning and smoothing techniques and the general trade-off between recursive, inherently sequential numerically advantageous properties and scalable parallelization for fine-grained architectures such as GPUs. In addition, we present mixed precision methods as a generic performance improvement technique in the context of iterative solvers. The third part of the lecture is then concerned with integrating the GPU-based components from the previous parts into large-scale PDE software that executes on heterogeneous clusters. We discuss the benefits and drawbacks of a "minimally invasive" integration vs. a full re-implementation in such clusters, technical details of an efficient management of heterogeneous resources (scheduling, overlapping communication with computation, etc.) and present case studies for applications from fluid dynamics, solid mechanics and geophysics.
10h30 - 11h00	Coffee break
11h00 - 12h30	Session 5 - 2nd part
	Dominik Goeddeke Institute of Applied Mathematics, TU Dortmund, Germany. dominik.goeddeke@math.tu-dortmund.de Title: Finte element methods on GPU systems Abstract: This session covers techniques for GPU-based Finite Element multigrid solvers. After presenting the necessary preliminaries, we outline the broader context in which the techniques can be applied. We introduce structured, block-structured and unstructured discretization grids for finite element, finite difference and finite volume techniques and discuss their computational and bandwidth requirements. Here, we also describe fine-grained parallelization techniques for the "finite element assembly", i.e., the construction of linear systems of equations from the discretization of a PDE on a mesh covering the computational domain. The second part of our lecture covers another building block in finite element simulation software, namely iterative solvers for sparse linear systems. We briefly present the necessary building blocks from numerical linear algebra that are needed to implement solvers of multigrid and Krylov subspace type; and then focus on numerically powerful preconditioning and smoothing techniques and the general trade-off between recursive, inherently sequential numerically advantageous properties and scalable parallelization for fine-grained architectures such as GPUs. In addition, we present mixed precision methods as a generic performance improvement technique in the context of iterative solvers. The third part of the lecture is then concerned with integrating the GPU-based components from the previous parts into large-scale PDE software that executes on heterogeneous clusters. We discuss the benefits and drawbacks of a "minimally invasive" integration vs. a full re-implementation in such clusters, technical details of an efficient management of heterogeneous resources (scheduling, overlapping communication with computation, etc.) and present case studies for applications from fluid dynamics, solid mechanics and geophysics.
12h30 - 14h00	Lunch
14h00 - 15h30	Session 6 - 1ère partie
	Felix Wolf German Research School for Simulation Sciences f.wolf@grs-sim.de Title: Methods for performance evaluation and optimization on modern HPC systems Abstract: General introduction to the performance analysis of parallel codes, followed by a lesson on communication and synchronization analysis using the Scalasca performance tool (http://www.scalasca.org). The lecture will give an overview of Scalasca, explain its functionality, and discuss case studies demonstrating its various analysis modes.
15h30 - 16h00	Coffee break
16h00 - 17h30	Session 6 - 2nd part
	Judith Gimenez Barcelona Supercomputing center judith@bsc.es Title: Methods for performance evaluation and optimization on modern HPC systems Abstract: Paraver and Dimemas (www.bsc.es/paraver) are part of CEPBA-Tools, an open source project developed by BSC. Paraver is based on traces and can be used to analyze any information expressed on its input trace format. Dimemas is a simulation tool for the analysis of message-passing applications behavior on a configurable platform. The talk would present the tools and examples of analysis illustrating the level of details and insight that these tools provide.
17h30 - 18h00	Debriefing

Thursday 9 june

9h00 - 10h30	Session 7 - 1rst part
	Raymond Namyst Equipe RUNTIME, INRIA Bordeaux - Sud–Ouest raymond.namyst@inria.fr Title: Programming heterogeneous, accelerator-based multicore machines: a runtime system's perspective Abstract: Heterogeneous accelerator-based parallel machines, featuring manycore CPUs and with GPU accelerators provide an unprecedented amount of processing power per node. Dealing with such a large number of heterogeneous processing units -- providing a highly unbalanced computing power -- is one of the biggest challenge that developpers of HPC applications have to face. To Fully tap into the potential of these heterogeneous machines, pure offloading approaches, that consist in running an application on regular processors while offloading part of the code on accelerators, are not sufficient. In this talk, I will go through the major programming environments that were specifically designed to harness heterogeneous architectures, focusing on runtime systems. I will discuss some of the most critical issues programmers have to consider to achieve portability of performance, and I will show how advanced runtime techniques can speed up applications in the domain of dense linear algebra. Eventually, I will give some insights about the main challenges designers of programming environments will have to face in upcoming years.
10h30 - 11h00	Coffee break
11h00 - 12h30	Session 7 - 2nd part
	Marc Tajchman CEA Marc.Tajchman@cea.fr Title: Programming paradigms using PGAS-based languages Abstract: PGAS (Partitioned Global Address Space) languages propose an abstract and unified view of the computing architecture (multiple nodes, each with one or more computing cores) and memory architecture (private and shared on-node memory, remote memory direct access). Algorithm implementation on parallel machines is easier with these languages, which offer flexible programming models mixing task and data parallelism. But these languages are still in an experimental state and performance must be improved. We will present a few test cases with several available languages of this family
12h30 - 14h00	Lunch
14h00 - 15h30	Session 8 - 1rst part
	Tim Warburton Rice University tim.warburton@gmail.com Title: GPU Accelerated Discontinuous Galerkin Methods Abstract: The discontinuous Galerkin (DG) methods can be viewed as a combination of finite volume and finite element methods, building high-order polynomial approximations on standard finite element meshes. However, the inter-element continuity of DG solutions is only weakly enforced. Discretized DG operators are typically sparse with dense sub blocks, and this structure makes them ideal candidates for general purpose graphics processing unit (GPGPU) implementations. We will describe how these methods map onto the current GPGPU thread hierarchy models prevalent in CUDA and OpenCL. After reviewing the basic formulation and implementations we will discuss how a new family of DG methods, for solving conservation laws on domains meshed with curvilinear elements, was motivated by GPGPU architectural considerations. Performance results and simulation results will be presented.
15h30 - 16h00	Coffee break
16h00 - 17h30	Session 8 - 2nd part
	Luigi Genovese CEA/INAC/SP2M Laboratoire de Simulation Atomistique luigi.genovese@esrf.fr Title: Wavelet-Based DFT calculations on Massively Parallel Hybrid Architectures Abstract: Electronic structure calculations (DFT codes) are certainly among the disciplines for which an increasing of the computational power correspond to an advancement in the scientific results. In this contribution, we present an implementation of a full DFT code that can run on massively parallel hybrid CPU-GPU clusters. Our implementation is based on modern GPU architectures which support double-precision floating-point numbers. This DFT code, named BigDFT, is delivered within the GNU-GPL license either in a stand-alone version or integrated in the ABINIT software package. Hybrid BigDFT routines were initially ported with NVidia's CUDA language, and recently more functionalities have been added with new routines writeen within Kronos' OpenCL standard. The formalism of this code is based on Daubechies wavelets, which is a systematic real-space based basis set. As we will see in the presentation, the properties of this basis set are well suited for an extension on a GPU-accelerated environment. In addition to focusing on the implementation of the operators of the BigDFT code, this presentation also relies of the usage of the GPU resources in a complex code with different kinds of operations. A discussion on the interest of present and expected performances of Hybrid architectures computation in the framework of electronic structure calculations is also adressed.
17h30 - 18h00	Debriefing

Friday 10 june

9h00 - 10h30	Session 9 - 1rst part
	Denis Caromel INRIA - ActiveEon denis.caromel@inria.fr Title: ProActive Hybrid Workflows with CPUs and GPUs Abstract: The presentation will give an overview of issues at hand when accelerating demanding applications with Multi-Cores, Clusters, Servers, GPUs and Clouds. The point will be illustrated with ProActive Parallel Suite, an OW2 Open Source library for parallel, distributed, and concurrent computing, allowing to showcase Interactive GUI and tools. Strong theoretical foundation ensures many properties for the resulting parallel programs. Typical SPMD programming will also be discussed. An important aspect of HPC today is the capacity to appropriately map various parallel tasks onto nowadays complex heterogeneous hardware (NUMA, Hybrid with CPU and GPU). We shall introduce how one can control such mapping in a fine manner, potentially selecting all hardware and software specificity on nodes, selecting proximity up to the latency between nodes.
10h30 - 11h00	Coffee break
11h00 - 12h30	Session 9 - 2nd part
	Florent Duchaine CERFACS Florent.Duchaine@cerfacs.fr Title: Large Eddy Simulation and Muti-physics for Engine computations on Massively Parallel Machines Abstract: Efficient numerical tools taking advantage of the ever-increasing power of high-performance computers become key elements in the fields of energy supply and transportation, not only from a purely scientific point of view, but also at the design stage in industry. Indeed, flow phenomena that occur in or around the industrial applications such as gas turbines or aircraft are still not mastered. In fact, most Computational Fluid Dynamics (CFD) predictions produced today focus on reduced or simplified versions of the real systems and are usually solved with a steady state assumption. The presentation shows how recent developments of CFD codes and parallel computer architectures can help overcoming this barrier. With this new environment, new scientific and technological challenges can be addressed provided that thousands of computing cores are efficiently used in parallel. Strategies of modern flow solvers are discussed with particular emphases on mesh-partitioning, load balancing and communication. These concepts are used in CFD codes developed by CERFACS. Leading edge computations obtained with these high-end massively parallel CFD codes are illustrated and discussed in the context of aircrafts, turbo-machinery and gas turbine applications. Finally, current developments of multi-physics simulations based on code coupling are discussed. It is shown how code coupling can be used to provide leading edge tools that directly benefit from high performance computing with strong industrial implications at the design stage of the next generation of aircraft and gas turbines.
12h30 - 14h00	Debriefing and buffet