HPCwire
 The global publication of record for High Performance Computing / July 23, 2004: Vol. 13, No. 29

Previous Article   |  Table of Contents  |  

Features:

HEC ANALYSIS: WHO NEEDS HIGH-BANDWIDTH APPLICATIONS?
Commentary from the High-End Crusader

The Earth Simulator is a sensitive litmus test among computer professionals: those who admire the Japanese machine appreciate the need for high-bandwidth applications in high-end computing, while those who deprecate it (think IBM) view high-bandwidth applications as a sideshow of interest to a small minority of high-end users.

Some aspects of this debate are problematic for the field of high-end computing; these issues _should_ have been clearly articulated and debated rationally. As the very diplomatic HECRTF report says, "We now have evidence that there are applications of national importance that would benefit significantly from an alternative to [low-bandwidth systems]". The report's authors identify "improvements in bandwidth and latency for both [local] memory and [system-interconnect] fabric, which for many applications largely determine performance", as an important goal for future high-end computing systems.

If the vital importance of high-bandwidth applications is obvious to some but not obvious to all, there must be some subtlety in the argument that permits bandwidth doubters to hold their view in the face of significant empirical evidence to the contrary. After all, we have made no real progress on any of the Grand Challenge problems articulated nearly a decade ago. Arguably, this lack of progress is largely due to our not having enough global bandwidth in our systems. None of this is rocket science (although it _is_ computational science). One goal of this article is to expose the subtlety in the argument for bandwidth.

Rick Stevens of Argonne National Laboratory once identified the main requirements for PFs/s computing as: 1) memory and cache footprints, i.e., the amount of memory required at each level of the memory hierarchy; 2) the degree of data reuse associated with core kernels of the application; 3) the scaling of these kernels; and 4) the associated estimates of the required local-memory and system-interconnect bandwidths. This is fine. But then Stevens stepped too close to a slippery slope.

In many cases, Stevens said, application analysis can be reduced to understanding the required memory capacity and the bandwidth requirements for the kernel algorithms in the application, as well as their scaling properties as the problem size grows. Here comes the slip. Traditional scientific application areas such as general circulation models, quantum chromodynamics, and fluid dynamics in astrophysics are relatively well understood and are not ---according to Stevens---particularly stressed by major memory or bandwidth requirements. In contrast, other application areas such as data mining and computer-aided reasoning are far more stressed.

The slip here is to identify a high-end application area with the individual (existing and not-yet-written) applications in that area. Application areas are surprisingly diverse. For example, there is a continuum of computational fluid dynamics applications from very high bandwidth to very low bandwidth. We need to direct attention away from application areas to the diversity of individual applications within an area. We need simple, intuitive application attributes to distinguish applications.

High-Bandwidth Systems

It is still the case that there are two basic types of supercomputer, which differ primarily in their system-interconnect bandwidths. However, this is not to deny that this distinction is a matter of degree. For clarity, we exhibit the pure types.

A weakly parallel, low-bandwidth system has four main properties: 1) it spends most of its dollar budget on transistors, 2) its performance is characterized by Linpack R_max, 3) it is satisfied with a low-bandwidth system interconnect, and 4) it has weakly parallel processors whose only hardware mechanisms for tolerating memory latency are long cache lines and large MPI messages.

A strongly parallel, high-bandwidth system has four corresponding properties: 1) it spends more than half its dollar budget on system interconnect, 2) its performance is characterized by sparse matrix-vector multiply (and perhaps also by Gups), 3) it requires a high-bandwidth system interconnect, and 4) it has strongly parallel processors with powerful hardware mechanisms for tolerating memory latency; these are forms of parallelism---such as vectors and multithreading---that produce the memory-reference concurrency that is consumed by the memory pipeline. (Latency tolerance is basically just pipelined data transport; for small messages, this requires concurrency).

This distinction is a matter of degree, with the possibility of many intermediate design points. For example, a system with a high-bandwidth system interconnect but weakly parallel processors (think Red Storm) makes good sense for applications with infrequent global communication, say, those with predominantly nearest-neighbor communication, and/or global communication only via large messages. The hardware bandwidth alone makes such a system perform well on the HPC Challenge Ptrans benchmark, i.e., parallel matrix transpose, but not on global Gups, which both limits message size and trashes the cache.

Sparse matrix-vector multiply characterizes the performance of strongly parallel, high-bandwidth systems. This point is so important that we come back to it. Sparse-matrix specialists are requested to be patient with the following elementary exposition.

We use compressed row storage (CRS) to multiply sparse matrix 'A' by source vector 'x' producing result vector 'y'. We show code best suited to scalar machines.

  'unsigned' i, j;
  'for' (j = 0; j < A.rows; j++) {
    'double' s = 0.0;
    'for' (i = A.row_start[j]; i < A.row_start[j+1]; i++)
       s += A.val[i] * x[A.col[i]];
     y[j] = s;
   }

The assignments to the result vector 'y' are seldom and need not concern us. The loads from 'A.val' and 'A.col' are unit-stride memory accessing, with no reuse of values, across arrays whose length is the number of nonzero elements in matrix 'A'. The operand bandwidth for unit-stride accessing by itself is characterized by Stream triad (if the references are local) or HPCC Ptrans (if the references are global).

A good weakly parallel, low-bandwidth system can have a Linpack R_max that is an appreciable fraction of its peak performance. But its unit-stride operand bandwidth in GBs/s will be only a small fraction of its Linpack performance in GFs/s. Only strongly parallel, high-bandwidth systems can achieve an operand bandwidth of up to 10 GBs/s per Linpack GF/s.

The real killer of weakly parallel, low-bandwidth systems in sparse MVM are the loads from source vector 'x'. Array 'x' is accessed at random stride with some reuse of values. If the data reuse (temporal locality) is very low, the operand bandwidth for such accessing is characterized by local or global Gups, a much, much smaller fraction of Linpack R_max. Still, if one were able to intelligently reorder the rows and columns of 'A', one could significantly increase the data reuse in 'x'. This is equivalent to the feasibility of matrix reordering via graph partitioning. But, there are some applications in which either there are no good graph partitions or they are too expensive to compute on-the-fly.

Moreover, there is a memory indirection in x[A.col[i]], forcing us to go to memory twice. Thus, the two cited sources of inefficiency in sustaining operand bandwidth in weakly parallel, low-bandwidth systems are combined. In contrast, strongly parallel, high-bandwidth systems sustain high operand bandwidth for fine-grained random memory accessing, both local and global.

Even at this elementary level, we can draw a preliminary general conclusion. We are discussing the sources of parallelism and locality in sparse matrix-vector multiply to determine if such operations are best performed by strongly parallel, high-bandwidth systems. If a good graph partition exists and is feasible to compute, then domain decomposition succeeds and we can both load balance and minimize communication. In such a case, any old parallel computer will do. However, in many applications, finding a good graph partition may _not_ be feasible. Then, we cannot find the temporal locality required to save (some of) our bacon. In such cases, sparse matrix-vector multiply becomes a bandwidth-intensive operation.

High-Bandwidth Applications

At the simplest level, high-bandwidth applications are those with extensive fine-grained long-range communication and/or load-balancing issues. When hardware locality mechanisms and careful programming do not guarantee local data access, industrial-strength hardware parallelism mechanisms, and system bandwidth, are required to tolerate the memory latency of global data access. When an a priori, static partitioning of computational elements and assignment to processors is not persistent, and dynamic load rebalancing is performed, industrial-strength system bandwidth is required to move the data structures around.

There are many reasons why a given application might require long-range communication. For example, a simulation of a very stiff physical system might need an implicit finite-difference scheme. This leads to taking big time steps. Since the spatial displacement is the product of the time step and the velocity field, this forces long-range communication.

This is a mainstream need. As David Keyes has noted, PDE simulation codes require implicit solvers for multiscale, multiphase, multiphysics phenomena from hydrodynamics, electromagnetism, radiation transport, chemical kinetics, and quantum chemistry. He also said that parallelism and locality in such codes is limited by the "global domains of influence in many problems presented to the computer as implicitly discretized operator equations--- implicitness being all but mandated for the multiscale systems of global climate, transonic airliners, petroleum reservoirs, tokamaks, etc.", which is what high-end (and Grand Challenge) simulation is all about.

There are other reasons. An application may make extensive use of iterative solvers to solve sparse linear systems. This involves many sparse matrix-vector multiplies and, as we have seen, this can force long-range communication. An application may require access to a database that is not localizable. The database may be too voluminous to replicate and no good graph partition for the database may exist. This forces long-range communication. And so on.

Note that stiffness, implicitness, ill-conditioned matrices, and good graph partitioning are---each in their own way---also questions of degree.

There are many reasons why a given application might require dynamic load balancing. An application might use irregular, adaptive meshes because it needs to accommodate lots of variability in the computation: the focus of computational effort may need to shift rapidly depending on where the need for resolution is greatest. This creates load imbalance. In counterterrorism data mining, an application may use parallel graph algorithms on large, dynamic, irregular, sparse graphs with a rapid shifting of the computational focus of attention. Invariably, the graph either cannot be partitioned and/or won't sit still long enough for you to partition it. (The soft real-time requirements in CT data mining do not allow interleaved graph repartitioning, much less dynamic data redistribution, between every pair of analytic-search steps). This creates load imbalance. And so on.

Here too, these load-balancing factors are questions of degree.

All these various "high-bandwidth" factors can combine. A shop may have a number of applications containing various kinds of PDEs and ODEs solved on an adaptive mesh similar to the one used in Alla-ET. One application might be time-dependent Euler and Navier-Stokes computational fluid dynamics.

The goal is a well-resolved solution achieved in the shortest wall-clock time possible. Getting a good speedup may be difficult due to the highly dynamic nature of the spatial and temporal adaptation. This typically leads to an irregular and rapidly changing distribution of data in computer memory. Another difficulty may be the complex physics involved. This leads to rapid and often unpredictable branching, and wide variations in the amount of work per computational element. On top of all this, the best algorithms to solve a problem may be implicit. This requires solutions of sparse linear systems.

All of the above requires bandwidth.

Low-Bandwidth Applications

At the simplest level, low-bandwidth applications are those with a successful domain-decomposition strategy (possibly in some generalized form). When a good static partition of computational elements is available, the load is balanced, the needs for storage are balanced, and the need to communicate is minimized. The all-to-all communication patterns typical of high-bandwidth applications have then been successfully replaced by localized communication, of which nearest-neighbor communication is the simplest example.

There are many reasons why an application might not require high bandwidth, without any sophisticated forcing on our part. We might be solving an ODE in which an explicit method such as Euler's method is appropriate. (Ultimately, we may wish to tackle a set of _coupled_ ODEs). Here, to solve a single ODE, a simple algorithm, namely, one sparse matrix-vector multiply per time step, is sufficient. If the stiffness is not too pronounced, we can avoid implicit methods such as Backward Euler's method. That would require larger time steps and a more difficult algorithm, namely, solving one sparse linear system per time step.

The application may involve only dense linear algebra so that direct solvers such as Gaussian elimination are available to solve linear systems. Regular, non-adaptive meshes may be sufficient to achieve the desired accuracy because there is no need to change resolution during the solution process to put the computational effort where it is needed. An application may be decomposable into naturally independent pieces. For example, the basic Monte Carlo algorithm is particle tracing. Particles are created in certain states according to source distribution functions and make transitions to other states according to scattering distribution functions. Finally, particles are annihilated according to absorption distribution functions.

In classic Monte Carlo, each particle is independent of the others and thus the algorithm splits naturally into independent pieces. This is in marked contrast to the situation where radiative transport is calculated in complex geometries. Here, the geometric database may be very large and not easily replicated. Also, the interactions of particles with the database are not localized, since particles may be in different parts of the environment. The richly non-local Monte Carlo algorithm MCNP (Monte Carlo N-Particle) is a classic example of a widely used, supremely useful high-bandwidth application. (MCNP is a general-purpose, continuous-energy, generalized-geometry, time-dependent, coupled neutron/photon/electron Monte Carlo transport code).

As a last example, parallel graph algorithms on static dense graphs are easily decomposed. Thus, in the dense case, it is easy to find a partition for the data structures in single-source shortest paths so that computation time is not swamped by communication time. Such parallel algorithms succeed because there are no load-balancing issues and communication is localized. However, these partitioning strategies fail when the graph is sparse and/or dynamic.

A Canonical High-Bandwidth Application

The following application structure occurs in at least two application areas: 1) climate simulation, and 2) high-resolution 3-D hydrodynamics with radiation transfer ("radiation hydro").

Imagine a simulation problem with a vast spectrum of space and time scales. For reasons of numerical stability, simulation of such a _stiff_ physical system may require an implicit finite-difference scheme. Convergence may demand such an approach; often, the reason is merely computational feasibilty.

Long-range communication will occur if the resulting time steps are big (the reason we chose the implicit scheme in the first place). Implicit methods for nonlinear PDEs solve a sparse linear system several times per time step. Often, the sparse matrix is not very well-conditioned. The intensive use ---once for each of _many_ linear systems---of iterative solvers with embedded sparse matrix-vector multiplies makes this simulation application massively bandwidth intensive,

In radiation hydro, the slow half of the calculation is the hydrodynamics calculation, which iterates over big time steps. The fast half is the radiosity calculation, which requires the solution of systems of nonlinear equations using Newton's method. This is iterated over small time steps. The Newton corrections are found by solving sparse systems of linear equations. Each big time step thus requires multiple invocations of a linear-system solver. These solvers are themselves iterative. The end result is that radiation hydro codes are massively bandwidth intensive.

System stiffness is one attribute that leads to high-bandwidth problems; a second attribute with the same result is system heterogeneity. For heterogeneous, e.g., multiphysics, problems, load imbalance is a serious issue. Irregular, adaptive meshes are another source of imbalance. To simplify, we could say that _stiffness_ and _adaptivity_ are the two primary sources of high-bandwidth computational fluid dynamics problems (more generally, problems of scientific simulation that depend on partial differential equations).

Wriggling Off The Bandwidth Hook

It widely accepted in scientific computing that, if you need high-quality results in the shortest possible execution time, there are many applications that appear to be ideally suited to the high-bandwidth form of high-end computers. Such applications are especially well-suited to symmetric multiprocessors (SMPs), such as the classical flat shared-memory Cray parallel vector processors and the Cray MTA-2, a multithreaded multiprocessor with flat shared-memory. (In early incarnations, Cascade will not be completely flat). They are not naturally suited to scalable MPP and cluster systems, and they are not even ideally suited to emerging parallel vector systems such as the Cray X1, because of the X1's nonuniform memory access.

Lacking an appropriate high-bandwidth system may prevent you from writing a fresh application, even if the results would be very valuable to you. But if you are dependent on a legacy application and the bandwidth of your machine decays underneath you, you will probably try to adapt. For better or worse, this is already happening.

The HECRTF report (p.63) states, "Numerous attempts are currently under way to retool codes in application areas such as stockpile stewardship, global climate modeling, computational fluid dynamics, local and regional weather forecasting, aircraft design, cosmology, biomolecular simulation, materials science, and quantum chemistry to run more efficiently on MPP architectures, simply because they are the most plentiful systems currently available for high-end computing". Obviously, these efforts divert attention away from the urgent task of developing at least some high-bandwidth systems that would run the original applications superbly without the need for retooling.

Because it affects the main argument for high-bandwidth systems, we need to assess this retooling process to precisely characterize its prospects for success---if this is at all possible. It turns out that, for large-scale problems, iterative linear solvers need to be preconditioned so that their convergence rates become acceptable. We examine a very representative (and instructive) example strategy, due to David Keyes, in which domain-decomposed multilevel methods are used to construct solvers with certain desirable algorithmic properties.

Adapting Terascale PDE Solvers

David Keyes of Columbia University has a research program focused on the discovery of preconditioners that would allow terascale optimal PDE solvers to run well on systems whose bandwidth requirements have been relaxed, at least in part. This has to be done without losing the ability to resolve any relevant scale.

Once we resign ourselves to low-bandwidth systems, it follows that algorithms must be: 1) highly concurrent and straightforward to load balance, 2) not communication bound, 3) cache friendly (with strong temporal and spatial locality of reference), and 4) highly scalable (in the sense of convergence rate) as problem size and number of processors are increased in proportion.

Keyes believes that domain-decomposed multilevel methods based on Schwarz preconditioners are "natural" for all these objectives. Here, "optimal" means that the convergence rate is nearly independent of discretization parameters (multilevel schemes for rapid linear convergence of linear problems and Newton-like schemes for quadratic convergence of nonlinear problems) and also that the convergence rate is as independent as possible of physical parameters (continuation schemes and physics-based preconditioning).

The scientific imperative is to handle multiple-scale applications. There may be multiple spatial scales (e.g., interfaces, fronts, and layers that are thin relative to domain size). There may be multiple temporal scales (e.g., fast waves with small transit times relative to convection, diffusion, or group-velocity dynamics). The analyst must isolate the dynamics of interest and model the rest in a system that can be discretized over a computable, i.e., manageable, range of scales.

Spatial resolution stresses condition number; with ill-conditioning, a small error in input may lead to a large error in output. Standard iterative methods fail due to growth in the number of iterations. Direct methods fail due to memory growth and bounded concurrency. What we can still try are hierarchical (multilevel) iterative methods.

Temporal resolution stresses stiffness; failure to track the fastest mode may lead to exponentially growing error in other modes. By definition, multiple-timescale problems contain phenomena with very different relaxation rates. The number of time steps corresponding to a given finite simulated time grows very large in order to preserve stability. The solution is to _step over_ fast modes by assuming quasi-equilibrium. (These are the big time steps in implicit schemes). But this throws temporally stiff problems into spatially ill-conditioned regimes.

Spatial resolution stresses memory size. Temporal resolution stresses raw arithmetic performance. Both stress long-range communication latency, which requires bandwidth to tolerate it, and together they _severely_ stress local-memory bandwidth. Less severely stressed for PDEs, in principle, are local-memory latency and long-range communication bandwidth, at least according to Keyes.

What decomposition strategies are available to us in our search for sources of parallelism and locality in PDE solvers?

Our starting point is a PDE operator 'L' acting on a velocity field 'u' equated to a force term 'f' in some domain 'Omega'. Operator decomposition (e.g., ADI) requires all-to-all, bulk data exchanges in each time step (for transpose). Function-space decomposition (e.g., Fourier or spectral methods) requires all-to-all, bulk data exchanges in each time step (for transform). Domain decomposition (e.g., Schwarz) requires local (nearest-neighbor) data exchanges, global reductions, and an optional small global problem (not needed for the _parabolic_ case) in each time step. We have few options if global communication is a bad thing!

Keyes is a proponent of Newton-Krylov-Schwarz algorithms. The Krylov method needs to be preconditioned for acceptable inner-iteration convergence rates, and this preconditioning is the "make-or-break" aspect of the implicit code. The job of a preconditioner is to approximate the action of the Jacobian inverse with acceptable computational cost.

It is not obvious that a good preconditioner approximating this inverse can avoid extensive global communication. But, in _some_ cases, an additive Schwarz preconditioner accomplishes this in a localized manner, with an approximate solve in each subdomain of a partitioning of the global PDE domain 'Omega'. We precondition Ax = b to B^-1 Ax = B^-1 b, where finding a good B^-1 is crucial to success. If we decompose a problem into a set of overlapping subdomains 'Omega_i', we compute B^-1 by: 1) gathering data from neighboring subdomains, 2) computing a local linear solve on each subdomain, and 3) scattering partial solutions to neighboring subdomains.

In words, we derive an approximate solver for the problem as the sum of the inverses of the restrictions of 'A' to a set of overlapping subdomains that covers 'Omega'. Graph partitioning is one of the tools used in decomposing a domain into subdomains.

How scalable is domain decomposition in practice? Keyes says, choosing good decompositions is a balance between conditioning (scalable convergence) and parallel computational complexity (e.g., the cost of computing restrictions and prolongations). We might add another important trade-off in selecting an effective preconditioner: choose between 1) a more complicated preconditioner that engenders more global communication, and 2) a simpler preconditioner with less global communication but a slower convergence rate.

What is the domain of applicability of domain-decomposition Schwarz methods? Keyes says, the value of domain decomposition in practical problems is ultimately problem specific. Your correspondent would like to go farther. In his view, it is possible to find "good enough" Schwarz preconditioners for a certain _class_ of PDE problems. There are many, many important problems where no one has any idea for a good preconditioner, indeed, where it would be a major mathematical discovery to find an effective preconditioner of _any_ kind.

Practitioners talk about the "difficulty" of the matrix 'A' in the linear system Ax = b. How to say this mathematically is above your correspondent's pay grade. Keyes has suggested a continuum from "good" to "bad" Helmholtz problems (this is related to diagonal dominance in the Jacobian).

Up to now, the improvement to the convergence rate has been largely in the constant term with little or no asymptotic improvement. This does not lift us far enough to tackle Grand Challenge problems on conventional parallel machines. Moreover, coupled systems of equations can lead to difficult differential operators for which we lose optimality, with resulting poor efficiency (slow convergence) as the problem size grows.

Conclusion: Bandwidth Expands Algorithm Choice

The Grand Challenge applications are characterized by 1) first-principles or multiple-scale modeling, 2) billions of degrees of freedom, 3) real-time simulation, 4) parametric studies, and 5) immersive interaction. We could simulate lots of truly new physics if solver limits were steadily pushed back (finer meshes, higher-fidelity models, more-complex coupling, etc.) on the broadest possible range of problems. However, we cannot move to the next level of science without employing new, high-bandwidth methods.

When high-bandwidth systems, such as the Earth Simulator, are available, we can run bandwidth-intensive applications such as spectral atmospheric simulation (AGCM). When they are not available, we tend to make do, or do without: we settle for low-bandwidth methods.

We have identified fully implicit methods, applications with nonlocalizable data bases, and adaptive mesh-refinement algorithms as three instances of high-bandwidth methods. The first two are instances of computing with global domains of influence, while the third is an instance of a constantly shifting computational focus of attention.

The quantum-chemistry code GAMESS has a database of 2-electron integrals that does not partition. For this reason, it scales to only 32 or 64 processors, because of communication requirements. So, the chemistry community only runs problems that complete in a reasonable amount of time on that number of processors.

Apparently, the adaptive-grid community either doesn't use adaptive grids or restructures rarely, again, because of communication requirements.

Implicit methods are vital. In any solution method, information flow across the domain creates a gap between 1) physical applications involving, at some fundamental level, all-to-all dependences between the unknown quantities; and 2) low-bandwidth parallel systems with high communication costs. (Runtime is how long it takes the information to cross the domain). Domain-decomposition Schwarz preconditioning methods seek to bridge this gap (there are other preconditioning strategies). When a "good" Schwarz decomposition can be found, we can approximate the inverse action of a discretized PDE operator with minimal global communication.

But why should we tie our hands by excluding effective preconditioners that engender significant global communication? We will make no progress on Grand Challenge problems until we have deep insight into their mathematical and physical structure. The result might well be a preconditioner that has lots of global communication and only runs well on a high-bandwidth system. Similarly, why should we tie our hands by avoiding dynamic load rebalancing?

We have made no real progress on Grand Challenge problems because decoupling unknowns in a fully implicit method is anything but a routine task; sometimes it is even unmanageable. In general, by catering to low-bandwidth systems, our algorithm choice has been too greatly restricted. For high sustained performance and scalability, Grand Challenge applications must employ new, high-bandwidth methods. Such methods only run well on new, high-bandwidth systems.


The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous. He alone bears responsibility for these commentaries. Replies are welcome and may be sent to HPCwire editor Tim Curns at tim@hpcwire.com.


Top of Page

Previous Article   |  Table of Contents  |