# Research and development activities

Research and development activities undertaken in the project are organized along 5 thematic poles.

Pole 1: Numerical linear algebra

Coordinators: Jean-Yves L’Excellent (Roma project-team) and Luc Giraud (Hiepacs project-team)

The design of the extreme-scale computing platforms that are expected to become available in the forthcoming decade will represent a convergence of technological trends and the boundary conditions imposed by over half a century of algorithm and application software development. These platforms will be hierarchical as they will provide coarse grain parallelism between nodes and fine grain parallelism within each node. They are also expected to be very heterogeneous since multicore chips and accelerators have completely different architectures and potentials. It is clear that such a degree of complexity will embody radical changes regarding the software infrastructure for large-scale scientific applications. Central numerical kernels such as fast transforms or numerical linear algebra solvers –dense or sparse– are intensively used in many large-scale computer simulations in science and engineering, where they often account for the most time-consuming part of the computations. It is widely recognized that there is not a single strategy that outperforms all the others for a large class or problems. To address these challenges, we consider in the project numerical kernels that exhibit a natural hierarchical dependency, which can be summarized in a bottom-up description as follows:

- Core numerical kernels. Many HPC applications intensively rely on the same core and critical kernels such as dense linear algebra solvers and Fast Fourier Transforms. Their efficient implementation via innovative algorithms is consequently central for the overall performance. Those kernels usually have a regular data access pattern which makes them candidates of choice for designing novel algorithms on each new hardware generation.
- Sparse direct solvers. They are based on Gaussian elimination and intensively use dense linear algebra kernels. They are very attractive because of their robustness and accuracy. Although the size of the problems that can be solved with direct solvers keeps increasing thanks to the evolution of algorithms and architectures their intrinsic memory and time complexity is not always not affordable for huge 3D problems. Nevertheless, direct solvers can be used as building-blocks for iterative methods.
- Preconditioned iterative solvers. We consider general purpose techniques, namely preconditioned Krylov subspace solvers. The targeted preconditioning techniques only exploit information available in the matrices of the linear systems associated with the problem (so called algebraic solvers). Compared with direct solvers, the complexity is significantly reduced by attempting to provide a trade-off between memory/processing cost on one side and robustness/accuracy on the other side. Algebraic domain decomposition techniques exhibit such computational features and exploit sparse direct solver capabilities on subproblems, their convergence can also be enhanced thanks to deflating or filtering techniques. Although generic, they may still be suboptimal for some classes of problems where information from the continuous problems (giving rise to the linear system) might be exploited.
- Continuous solvers. They consist of domain decomposition techniques that take into account the mathematical properties of the differential equations. They are often optimal but they still require the solution of some differential equations on the subproblems. Those local solutions can be obtained using either sparse direct solvers or iterative solvers depending on the size of the local problems and features of the target computing platforms.

The combination of the above-mentioned techniques will enable to exploit many levels of parallelism with various granularity suited for an effective use of the forthcoming heterogeneous extreme-scale platforms. The advances in the methods will benefit in particular but not exclusively to the solution of the systems of differential equations characterizing the uses cases considered in the project.

Pole 2: Numerical schemes for PDE models

Coordinators: Jocelyne Erhel (Sage project-team) and Philippe Helluy (Calvi project-team)

The availability of massively parallel systems with theoretical
floating point performances in the petascale range, and in the
exascale range in a near future, will enable the numerical treatment
of more challenging problems involving, on the one hand, discretized
models with higher spatial resolution and, on the other hand, the
access to longer time scales but also to more complex physical models
possibly encompassing multiple space and time scales. However, the
simulation of such complex physical phenomena will require very
accurate and efficient numerical schemes that will ideally be able to
automatically switch between arbitrary high order accuracy in regions
where the problem solution is smooth, and low order accuracy combined
to local adaptivity of the discretization mesh in less regular
regions. Moreover, the targeted simulation tools will possibly have to
couple several mathematical models in a multiscale space-time
setting. From the algorithmic point of view, the implementation of the
proposed numerical schemes will have to be optimized for maximizing
both the single core computational performances and the scalability in
view of exploiting massive parallelism. In this context, the objective
of the activities undertaken in this thematic pole is to address these
needs for some general widely used types of differential equations, as
well as to develop complex solvers tuned for some specific challenging
physical problems.

The expertise of the project-teams that are active in this thematic
pole is about the mathematical analysis and numerical study of
differential equations modeling various types of physical problems.
Nuclear energy production and radioactive waste management are two
application domains that are addressed in the first place. The
corresponding scientific and engineering use cases deal with
transport, diffusion and convection-diffusion problems. These use
cases are proposed by the two external partners that are participating
to the project at its start: ANDRA (French National Agency for
Radioactive Waste Management) and CEA (French Alternative Energies and
Atomic Energy Commission). In addition to these two central
application challenges, other physical problems (such as seismis wace
propagation and electromagnetic wave propagation) are considered by
the project partners for the purpose of demonstrating the benefits of
the methodological contributions of this thematic pole (and the other
thematic poles as well).

Pole 3: Optimization of performances of numerical solvers

Coordinators: François Pellegrini (Bacchus project-team) and Olivier Aumage (Runtime project-team)

The research and development activities at the heart of the thematic poles 1 and 2 deal with core ingredients of the numerical treatment of the systems of differential equations modeling the computational physics problems considered in the project. A feature shared by the techniques adopted for the discretization of these systems of differential equations is that they rely on unstructured meshes and modern principles for building high order, possibly auto-adaptive (i.e. hp-adaptive) finite element or finite volume type methods. At the discrete level, these methods result in large sparse or hybrid sparse-dense linear systems of equations for which we propose to study solution strategies that combine, in a hierarchical way, iterative and direct building blocks. Then, the activities undertaken in these poles aim at designing differential equation solvers and associated numerical kernels that exploit as far as possible the architectural characteristics of modern massively parallel computing platforms. Besides, these solvers will be characterized by complex data structures and highly irregular data access patterns which will drastically impact the sustained computational performances, in particular if one does not take into consideration crucial complementary questions related to the mapping of data sets and numerical algorithms to the targeted computing platforms. This is exactly the objective of the thematic pole 3 which is concerned with the optimization of the performances of numerical solvers by considering topics that often make a link or are at the interface between computer science techniques and tools for exploiting high performance computing systems and application software. While doing so, two architectural considerations that must be dealt with are the hierarchical organization of the memory and the heterogeneity of processing units and networks. The topics to be addressed in this thematic pole are the following:

- Resource management and scheduling strategies. The resource management topic can be subdivided in two parts. First, as applications become more complex, their mapping to resource also becomes more complex. For example, some parts of an application can depend on the input data size or the number of actual resources. The current solution is to mix application optimization code with the functional code of an application. While it may work for static applications, it is even less satisfactory for dynamic applications. Therefore, the issue is to revisit interactions between application and resource management systems. Second, system services should be able to operate efficiently at very large scale. Hence, the issue is to design distributed algorithms for example for deployment, discovery or composition/orchestration services. Finally, applications can be written as workflows with data dependencies between tasks (which in turn can be parallel). These applications need specific resource and data management heuristics. Moreover, typical high performance applications require both an initial mapping of data and tasks and periodic re-scheduling to address the dynamic behavior of the system environment (e.g. failure) or of the application itself. The programming model and the role of the programmer are important in order to allow for flexibility in the scheduling algorithm. For instance, using an abstract data flow graph representation allows using, transparently with respect to the application code, graph partitioning algorithms in order to redistribute the workload together with a local (or global) work stealing algorithm to adapt variation of the load among the cores.
- Runtime systems. Traditional processors have reached architectural limits which heterogeneous multicore designs and hardware specialization (e.g. coprocessors, accelerators, etc.) intend to address. However, exploiting such machines introduces numerous challenging issues at all levels, ranging from programming models and compilers to the design of scalable hardware solutions. The design of efficient runtime systems for these architectures is a critical issue. In the project, we intend to address research topics that are all essential steps in solving the general parallel composability issue at the runtime system level: (1) micro runtime system supporting multiple parallel paradigms, (2) framework for experimenting with hybrid programs, (3) adaptive parallelism and code auto-tuning and, (4) validation using well known parallel kernels.
- Techniques and tools for static and dynamic processing of numerical data sets. Finite element or finite volume differential equation solvers such as those considered in the project are naturally adapted to the use of unstructured meshes for the discretization of the underlying computational domain. Such meshes are amenable to local refinement or coarsening so as to match the complex geometrical features or/and the steep variations of the solved physical quantities. However, this enhanced flexibility comes at the expense of an increased difficulty to optimize and preserve data locality. Moreover, very large-scale 3D simulations generally involve unstructured meshes sizes ranging from a few tens up to several hundreds million elements if not more. Distributing such meshes on several thousands of processing units in view of their parallel processing by the PDE solvers is a task that can hardly be tackled as a sequential process, especially if this partitioning has to be performed more than once during a simulation as a result of an auto-adaptive discretization process. Overall, several subtopics have to be considered here that are related to the manipulation of very large numerical data sets for their distribution either before (i.e. static partitioning) or during (i.e. dynamic repartitioning - or re-meshing) a numerical simulation, dynamic load balancing, and for their runtime analysis (i.e. computational steering).

Pole 4: Programming models

Coordinators: Thierry Gautier (Moais project-team) and Christian Perez (Avalon project-team)

Exascale computing will offer to the programmer the opportunity to use millions of cores. Cores will certainly be heterogeneous, where a simple chip may contains numerous simple cores and few complex cores (CPU); memory will be hierarchically organized. Moreover, as exascale computing will provide more computing and storage capabilities, it will enable new kinds of applications. For example, multiphysics applications will be able to integrate more phenomena. It will be unimaginable that programmers will be exposed directly to these hardware and software complexities, which must be carried out by the software stack between the application and the hardware rather than by the programmers themselves. In this thematic pole, programming models will be investigated to reduce the previously mentioned complexities. The common idea is to structure the parallelism and its description using known skeletons or patterns with (guaranteed) good performances on multiple architectures. The absence of consensus to write or compose HPC applications motivates the research of alternative ways to program them. Two research directions have been identified:

- Component models. Here a hierarchical approach is adopted where, at the lower level, each component is assumed to efficiently exploit the hardware, while at the upper level the coordination between components is responsible to manage communication and synchronization taking into account the hierarchical and heterogeneous characteristics of the underlying architecture as well as the knowledge of interactions between distributed components. However, current HPC applications are still written according to a low level of abstraction. This puts a heavy burden on application developers as they have to be expert of the target machine to make efficient implementation choices. Moreover, as this work as to be done for each new machine, porting an application to a new machine is a very long, difficult and expensive task. As resources become hierarchical and heterogeneous, this approach seems less and less acceptable.

Without suitable coordination steps the approach of aggregating multiple parallel codes, libraries and frameworks into composite applications is bound to fail at some point, as computing units quantity and heterogeneity levels keep increasing. Indeed, most existing codes have been written with the assumption of a mostly exclusive access to the underlying hardware. Now, application development has to evolve together with runtime systems (pole 3) to make every piece of code cooperate rather than interfere to ensure an efficient overall behavior on exascale architectures and an efficient use of their power. - High level parallel programming models. Although very attractive, this direction is difficult to manage: on one hand, standard programming models, such as MPI, Openmp are well adopted by the HPC community that have billion of lines of codes that reclaims huge amounts of work to port to new standards; on the other hand with the complexity of new exascale architectures, several researchers have made criticisms about the capacity of these programming models to be able to be productive with good portability of efficiency of the programs. Several new languages, such as X10 (IBM), Fortress (Sun) or Chapel (Cray) which are funded by DARPA’s High Productivity Computing Systems project, have been proposed in order to abstract the architecture (physical parallelism) to allow the development of architecture-oblivious algorithms. High-level programming offers a high level and uniform view of the (heterogeneous) hardware. Most high-level programming models are based on a program-centric view to describe interaction between tasks. Researchers aims at adapting the required number of tasks and studying scheduling algorithms that maps logical parallelism of the application to the physical available parallelism.

Widely adopted standards are invaluable however as a foundation for experimenting with new programming model ideas by testing carefully selected extensions to improve expressiveness and better guide underlying runtime systems. Moreover annotation-based programming models present several advantages of their own. They allow incremental ports of existing applications, thus avoiding expensive and hazardous whole application rewrites. They enable the developers to continuously test their applications during the port process and thus to quickly address issues that may arise from such a port instead of waiting for the whole port to be complete. Finally they bring in the often underused analysis power of the compiler which can be exploited to generate hints during the processing of pragmas.

The key points here are providing a correct abstraction of the underlying architecture, finding good ways to describe the application parallelism or to identify computational patterns and, providing a scheduling algorithm that maps the logical parallelism of the application to the physical available hardware resources. Right features and design decisions allow reducing the overhead inherent to architecture abstraction. This is why the thematic pole 4 is strongly connected with thematic pole 3 that address runtime tools and algorithms to exploit parallel architecture. Resource managers, scheduling algorithms, graph partitioning and load balancing algorithms that are studied in the context of the thematic pole 3 are viewed as basic parallel building blocks that will be coordinated at an upper level by taking into account both architecture abstraction and structural properties of parallel programs.

Pole 5: Resilience for exascale computing

Coordinators: Frédéric Vivien (Roma project-team) and Laura Grigori (Alpines project-team)

Simulation is developing as a third methodological pillar for many science disciplines, in addition to theory and experiments. Simulations, which are done on high performance computers, are slowed down by preventive and corrective actions used to mask hardware or software component faults. At extreme scale (such as the PRACE systems, NSF Track 1 and Track 2 systems or the Department of Energy Leadership computing systems), where high performance computers are pushed to their limits in scale and technologies, faults are so frequent that not only they challenge the robustness of the mechanisms that are supposed to ensure reliable executions, but they also force to use these mechanisms much more frequently, thereby significantly decreasing the simulation performance. With the advent of exascale, computing, involving 10,000,000 processor cores or more, experts of fault tolerance for HPC systems consider that the time between two consecutive faults will be lower than the time required for a restart-checkpoint cycle. At this point, no actual progress of the application can be expected. An important objective is to develop new algorithmic techniques to solve the exascale resilience problem. Solving this problem implies a rupture from current approaches, and calls for yet-to-be-discovered algorithms, protocols and software tools. In the project the following research directions will be considered:

- Energy effective fault tolerant protocols for HPC applications.
- Design of efficient fault tolerant protocols and algorithm-basedfault tolerance.
- Performance execution models for fault-tolerant applications.
- Resilience for sparse linear algebra.