Séminaires Sinus

The next four orders of magnitude in PDE simulation performance

David E. Keyes

Old Dominion University (VA) , Lawrence Livermore Natl Lab (CA) and ICASE (VA)

keyes@icase.edu

DATE : lundi 19/06/00
HEURE et LIEU : 14h30 en salle du conseil

RESUME/ABSTRACT : while certain linear algebra and computational chemistry problems whose computational work requirements are superlinear in memory requirements have executed at 1 Teraflop/s, simulations of PDE-based systems remain ``mired'' in the 100's of Gigaflop/s on the same machines. In this presentation, we briefly review the algorithmic structure of typical PDE solvers that is responsible for this situation and consider possible architectural and algorithmic sources for performance improvement towards the achievement of the remaining four orders of magnitude required to reach 1 Petaflop/s.
A simple bulk-synchronous scaling argument suggests that continued expansion of the number of processors is feasible as long as the architecture provides a global reduction operation whose time-complexity is sublinear in the number of processors. However, the cost-effectiveness of this brute-force approach towards petaflop/s is highly sensitive to frequency and latency of global reduction operations, and to modest departures from perfect load balance.
Looking internal to a processor, we argue that there are only two intermediate levels of the memory hierarchy that are essential to a typical domain-decomposed PDE simulation, and therefore that most of the system cost and performance cost for maintaining a deep multilevel memory hierarchy could be better invested in improving access to the relevant workingsets (associated with individual local stencils (matrix rows) and entire subdomains). Improvement of local memory bandwidth and multithreading (together with intelligent prefetching -- perhaps processors in memory) to exploit it could contribute approximately an order of magnitude of performance within a processor relative to present architectures. Sparse problems will never have the locality advantages of dense problems, but it is only necessary to stream data at the rate at which the processor can consume it, and what sparse problems lack in locality, they can make up for by scheduling. With statically discretized PDEs, the scheduling is deterministic. The usual ramping up of processor clock rates and the width or multiplicity of instructions issued are other obvious avenues for per-processor computational rate improvement, but only if memory bandwidth is raised proportionally.
Besides these two classes of architectural improvements --- more and better-suited processor/memory elements --- we consider two classes of algorithmic improvements: some that improve the raw flop rate and some that increase the scientific value of what can be squeezed out of the average flop.
In the first category, we mention higher-order discretization schemes, especially of discontinuous or mortar type, orderings that improve data locality, and iterative methods that are less synchronous than today's.
It can be argued that the second category of algorithmic improvements does not belong in a discussion focused on computational rates, at all. However, since the ultimate purpose of computing is insight, not petaflop/s, it must be mentioned as part of a balanced program, especially since it is not conveniently orthogonal to the other approaches. We make a brief pitch for revolutionary improvements in the practical use of problem-driven algorithmic adaptivity in PDE solvers --- not just better system software support for well understood discretization-error driven adaptivity, but true polyalgorithmic and multiple-model adaptivity. To plan for a ``bee-line'' port of existing PDE solvers to petaflop/s architectures and to ignore the demands of the next generation of solvers will lead to petaflop/s platforms whose effectiveness in scientific and engineering computing might be equivalent to less powerful but more versatile platforms. The danger of such a pyrrhic victory is real.

Brief Biosketch of David Keyes : David Keyes is a computational mathematician with primary interests in parallel numerical algorithms and large-scale simulations of transport phenomena -- fluids, combustion, and radiation. He is the Chair of the Mathematics & Statistics Department and an adjunct member of the Computer Science Department at Old Dominion University and an Associate Research Fellow at the Institute for Computer Applications in Science and Engineering (ICASE) at the NASA Langley Research Center. He is presently the Acting Director of the Institute for Scientific Computing Research (ISCR) at Lawrence Livermore National Laboratory.
Keyes is active in SIAM and in the Domain Decomposition, Parallel CFD, and Petaflops architecture communities. His academic background includes degrees in Mechanical Engineering (BSE, Princeton, 1978) and Applied Mathematics (PhD, Harvard, 1984), and a post-doc in Computer Science (Yale, 1984-1985).

Retour: Liste des séminaires du Projet Sinus