A simple bulk-synchronous scaling argument suggests that continued expansion of the number of processors is feasible as long as the architecture provides a global reduction operation whose time-complexity is sublinear in the number of processors. However, the cost-effectiveness of this brute-force approach towards petaflop/s is highly sensitive to frequency and latency of global reduction operations, and to modest departures from perfect load balance.
Looking internal to a processor, we argue that there are only two intermediate levels of the memory hierarchy that are essential to a typical domain-decomposed PDE simulation, and therefore that most of the system cost and performance cost for maintaining a deep multilevel memory hierarchy could be better invested in improving access to the relevant workingsets (associated with individual local stencils (matrix rows) and entire subdomains). Improvement of local memory bandwidth and multithreading (together with intelligent prefetching -- perhaps processors in memory) to exploit it could contribute approximately an order of magnitude of performance within a processor relative to present architectures. Sparse problems will never have the locality advantages of dense problems, but it is only necessary to stream data at the rate at which the processor can consume it, and what sparse problems lack in locality, they can make up for by scheduling. With statically discretized PDEs, the scheduling is deterministic. The usual ramping up of processor clock rates and the width or multiplicity of instructions issued are other obvious avenues for per-processor computational rate improvement, but only if memory bandwidth is raised proportionally.
Besides these two classes of architectural improvements --- more and better-suited processor/memory elements --- we consider two classes of algorithmic improvements: some that improve the raw flop rate and some that increase the scientific value of what can be squeezed out of the average flop.
In the first category, we mention higher-order discretization schemes, especially of discontinuous or mortar type, orderings that improve data locality, and iterative methods that are less synchronous than today's.
It can be argued that the second category of algorithmic improvements does not belong in a discussion focused on computational rates, at all. However, since the ultimate purpose of computing is insight, not petaflop/s, it must be mentioned as part of a balanced program, especially since it is not conveniently orthogonal to the other approaches. We make a brief pitch for revolutionary improvements in the practical use of problem-driven algorithmic adaptivity in PDE solvers --- not just better system software support for well understood discretization-error driven adaptivity, but true polyalgorithmic and multiple-model adaptivity. To plan for a ``bee-line'' port of existing PDE solvers to petaflop/s architectures and to ignore the demands of the next generation of solvers will lead to petaflop/s platforms whose effectiveness in scientific and engineering computing might be equivalent to less powerful but more versatile platforms. The danger of such a pyrrhic victory is real.
Keyes is active in SIAM and in the Domain Decomposition, Parallel CFD, and Petaflops architecture communities. His academic background includes degrees in Mechanical Engineering (BSE, Princeton, 1978) and Applied Mathematics (PhD, Harvard, 1984), and a post-doc in Computer Science (Yale, 1984-1985).