
Features:
HEC ANALYSIS: WHO NEEDS HIGH-BANDWIDTH APPLICATIONS?
Commentary from the High-End Crusader
The Earth Simulator is a sensitive litmus test among computer professionals:
those who admire the Japanese machine appreciate the need for high-bandwidth
applications in high-end computing, while those who deprecate it (think IBM)
view high-bandwidth applications as a sideshow of interest to a small minority
of high-end users.
Some aspects of this debate are problematic for the field of high-end
computing; these issues _should_ have been clearly articulated and debated
rationally. As the very diplomatic HECRTF report says, "We now have evidence
that there are applications of national importance that would benefit
significantly from an alternative to [low-bandwidth systems]". The report's
authors identify "improvements in bandwidth and latency for both [local]
memory and [system-interconnect] fabric, which for many applications largely
determine performance", as an important goal for future high-end computing
systems.
If the vital importance of high-bandwidth applications is obvious to some but
not obvious to all, there must be some subtlety in the argument that permits
bandwidth doubters to hold their view in the face of significant empirical
evidence to the contrary. After all, we have made no real progress on any of
the Grand Challenge problems articulated nearly a decade ago. Arguably, this
lack of progress is largely due to our not having enough global bandwidth in
our systems. None of this is rocket science (although it _is_ computational
science). One goal of this article is to expose the subtlety in the argument
for bandwidth.
Rick Stevens of Argonne National Laboratory once identified the main
requirements for PFs/s computing as: 1) memory and cache footprints, i.e., the
amount of memory required at each level of the memory hierarchy; 2) the degree
of data reuse associated with core kernels of the application; 3) the scaling
of these kernels; and 4) the associated estimates of the required local-memory
and system-interconnect bandwidths. This is fine. But then Stevens stepped
too close to a slippery slope.
In many cases, Stevens said, application analysis can be reduced to
understanding the required memory capacity and the bandwidth requirements for
the kernel algorithms in the application, as well as their scaling properties
as the problem size grows. Here comes the slip. Traditional scientific
application areas such as general circulation models, quantum chromodynamics,
and fluid dynamics in astrophysics are relatively well understood and are not
---according to Stevens---particularly stressed by major memory or bandwidth
requirements. In contrast, other application areas such as data mining and
computer-aided reasoning are far more stressed.
The slip here is to identify a high-end application area with the individual
(existing and not-yet-written) applications in that area. Application areas
are surprisingly diverse. For example, there is a continuum of computational
fluid dynamics applications from very high bandwidth to very low bandwidth.
We need to direct attention away from application areas to the diversity of
individual applications within an area. We need simple, intuitive application
attributes to distinguish applications.
High-Bandwidth Systems
It is still the case that there are two basic types of supercomputer, which
differ primarily in their system-interconnect bandwidths. However, this is
not to deny that this distinction is a matter of degree. For clarity, we
exhibit the pure types.
A weakly parallel, low-bandwidth system has four main properties: 1) it spends
most of its dollar budget on transistors, 2) its performance is characterized
by Linpack R_max, 3) it is satisfied with a low-bandwidth system interconnect,
and 4) it has weakly parallel processors whose only hardware mechanisms for
tolerating memory latency are long cache lines and large MPI messages.
A strongly parallel, high-bandwidth system has four corresponding properties:
1) it spends more than half its dollar budget on system interconnect, 2) its
performance is characterized by sparse matrix-vector multiply (and perhaps
also by Gups), 3) it requires a high-bandwidth system interconnect, and 4)
it has strongly parallel processors with powerful hardware mechanisms for
tolerating memory latency; these are forms of parallelism---such as vectors
and multithreading---that produce the memory-reference concurrency that is
consumed by the memory pipeline. (Latency tolerance is basically just
pipelined data transport; for small messages, this requires concurrency).
This distinction is a matter of degree, with the possibility of many
intermediate design points. For example, a system with a high-bandwidth
system interconnect but weakly parallel processors (think Red Storm) makes
good sense for applications with infrequent global communication, say, those
with predominantly nearest-neighbor communication, and/or global communication
only via large messages. The hardware bandwidth alone makes such a system
perform well on the HPC Challenge Ptrans benchmark, i.e., parallel matrix
transpose, but not on global Gups, which both limits message size and trashes
the cache.
Sparse matrix-vector multiply characterizes the performance of strongly
parallel, high-bandwidth systems. This point is so important that we come
back to it. Sparse-matrix specialists are requested to be patient with the
following elementary exposition.
We use compressed row storage (CRS) to multiply sparse matrix 'A' by source
vector 'x' producing result vector 'y'. We show code best suited to scalar
machines.
'unsigned' i, j;
'for' (j = 0; j < A.rows; j++) {
'double' s = 0.0;
'for' (i = A.row_start[j]; i < A.row_start[j+1]; i++)
s += A.val[i] * x[A.col[i]];
y[j] = s;
}
The assignments to the result vector 'y' are seldom and need not concern us.
The loads from 'A.val' and 'A.col' are unit-stride memory accessing, with no
reuse of values, across arrays whose length is the number of nonzero elements
in matrix 'A'. The operand bandwidth for unit-stride accessing by itself is
characterized by Stream triad (if the references are local) or HPCC Ptrans (if
the references are global).
A good weakly parallel, low-bandwidth system can have a Linpack R_max that is
an appreciable fraction of its peak performance. But its unit-stride operand
bandwidth in GBs/s will be only a small fraction of its Linpack performance in
GFs/s. Only strongly parallel, high-bandwidth systems can achieve an operand
bandwidth of up to 10 GBs/s per Linpack GF/s.
The real killer of weakly parallel, low-bandwidth systems in sparse MVM are
the loads from source vector 'x'. Array 'x' is accessed at random stride with
some reuse of values. If the data reuse (temporal locality) is very low, the
operand bandwidth for such accessing is characterized by local or global Gups,
a much, much smaller fraction of Linpack R_max. Still, if one were able to
intelligently reorder the rows and columns of 'A', one could significantly
increase the data reuse in 'x'. This is equivalent to the feasibility of
matrix reordering via graph partitioning. But, there are some applications in
which either there are no good graph partitions or they are too expensive to
compute on-the-fly.
Moreover, there is a memory indirection in x[A.col[i]], forcing us to go to
memory twice. Thus, the two cited sources of inefficiency in sustaining
operand bandwidth in weakly parallel, low-bandwidth systems are combined.
In contrast, strongly parallel, high-bandwidth systems sustain high operand
bandwidth for fine-grained random memory accessing, both local and global.
Even at this elementary level, we can draw a preliminary general conclusion.
We are discussing the sources of parallelism and locality in sparse
matrix-vector multiply to determine if such operations are best performed by
strongly parallel, high-bandwidth systems. If a good graph partition exists
and is feasible to compute, then domain decomposition succeeds and we can both
load balance and minimize communication. In such a case, any old parallel
computer will do. However, in many applications, finding a good graph
partition may _not_ be feasible. Then, we cannot find the temporal locality
required to save (some of) our bacon. In such cases, sparse matrix-vector
multiply becomes a bandwidth-intensive operation.
High-Bandwidth Applications
At the simplest level, high-bandwidth applications are those with extensive
fine-grained long-range communication and/or load-balancing issues. When
hardware locality mechanisms and careful programming do not guarantee local
data access, industrial-strength hardware parallelism mechanisms, and system
bandwidth, are required to tolerate the memory latency of global data access.
When an a priori, static partitioning of computational elements and assignment
to processors is not persistent, and dynamic load rebalancing is performed,
industrial-strength system bandwidth is required to move the data structures
around.
There are many reasons why a given application might require long-range
communication. For example, a simulation of a very stiff physical system
might need an implicit finite-difference scheme. This leads to taking big
time steps. Since the spatial displacement is the product of the time step
and the velocity field, this forces long-range communication.
This is a mainstream need. As David Keyes has noted, PDE simulation codes
require implicit solvers for multiscale, multiphase, multiphysics phenomena
from hydrodynamics, electromagnetism, radiation transport, chemical kinetics,
and quantum chemistry. He also said that parallelism and locality in such
codes is limited by the "global domains of influence in many problems
presented to the computer as implicitly discretized operator equations---
implicitness being all but mandated for the multiscale systems of global
climate, transonic airliners, petroleum reservoirs, tokamaks, etc.", which
is what high-end (and Grand Challenge) simulation is all about.
There are other reasons. An application may make extensive use of iterative
solvers to solve sparse linear systems. This involves many sparse
matrix-vector multiplies and, as we have seen, this can force long-range
communication. An application may require access to a database that is not
localizable. The database may be too voluminous to replicate and no good
graph partition for the database may exist. This forces long-range
communication. And so on.
Note that stiffness, implicitness, ill-conditioned matrices, and good graph
partitioning are---each in their own way---also questions of degree.
There are many reasons why a given application might require dynamic load
balancing. An application might use irregular, adaptive meshes because it
needs to accommodate lots of variability in the computation: the focus of
computational effort may need to shift rapidly depending on where the need for
resolution is greatest. This creates load imbalance. In counterterrorism
data mining, an application may use parallel graph algorithms on large,
dynamic, irregular, sparse graphs with a rapid shifting of the computational
focus of attention. Invariably, the graph either cannot be partitioned and/or
won't sit still long enough for you to partition it. (The soft real-time
requirements in CT data mining do not allow interleaved graph repartitioning,
much less dynamic data redistribution, between every pair of analytic-search
steps). This creates load imbalance. And so on.
Here too, these load-balancing factors are questions of degree.
All these various "high-bandwidth" factors can combine. A shop may have a
number of applications containing various kinds of PDEs and ODEs solved on an
adaptive mesh similar to the one used in Alla-ET. One application might be
time-dependent Euler and Navier-Stokes computational fluid dynamics.
The goal is a well-resolved solution achieved in the shortest wall-clock time
possible. Getting a good speedup may be difficult due to the highly dynamic
nature of the spatial and temporal adaptation. This typically leads to an
irregular and rapidly changing distribution of data in computer memory.
Another difficulty may be the complex physics involved. This leads to rapid
and often unpredictable branching, and wide variations in the amount of work
per computational element. On top of all this, the best algorithms to solve
a problem may be implicit. This requires solutions of sparse linear systems.
All of the above requires bandwidth.
Low-Bandwidth Applications
At the simplest level, low-bandwidth applications are those with a successful
domain-decomposition strategy (possibly in some generalized form). When a
good static partition of computational elements is available, the load is
balanced, the needs for storage are balanced, and the need to communicate is
minimized. The all-to-all communication patterns typical of high-bandwidth
applications have then been successfully replaced by localized communication,
of which nearest-neighbor communication is the simplest example.
There are many reasons why an application might not require high bandwidth,
without any sophisticated forcing on our part. We might be solving an ODE in
which an explicit method such as Euler's method is appropriate. (Ultimately,
we may wish to tackle a set of _coupled_ ODEs). Here, to solve a single ODE,
a simple algorithm, namely, one sparse matrix-vector multiply per time step,
is sufficient. If the stiffness is not too pronounced, we can avoid implicit
methods such as Backward Euler's method. That would require larger time steps
and a more difficult algorithm, namely, solving one sparse linear system per
time step.
The application may involve only dense linear algebra so that direct solvers
such as Gaussian elimination are available to solve linear systems. Regular,
non-adaptive meshes may be sufficient to achieve the desired accuracy because
there is no need to change resolution during the solution process to put the
computational effort where it is needed. An application may be decomposable
into naturally independent pieces. For example, the basic Monte Carlo
algorithm is particle tracing. Particles are created in certain states
according to source distribution functions and make transitions to other
states according to scattering distribution functions. Finally, particles are
annihilated according to absorption distribution functions.
In classic Monte Carlo, each particle is independent of the others and thus
the algorithm splits naturally into independent pieces. This is in marked
contrast to the situation where radiative transport is calculated in complex
geometries. Here, the geometric database may be very large and not easily
replicated. Also, the interactions of particles with the database are not
localized, since particles may be in different parts of the environment.
The richly non-local Monte Carlo algorithm MCNP (Monte Carlo N-Particle) is a
classic example of a widely used, supremely useful high-bandwidth application.
(MCNP is a general-purpose, continuous-energy, generalized-geometry,
time-dependent, coupled neutron/photon/electron Monte Carlo transport code).
As a last example, parallel graph algorithms on static dense graphs are easily
decomposed. Thus, in the dense case, it is easy to find a partition for the
data structures in single-source shortest paths so that computation time is
not swamped by communication time. Such parallel algorithms succeed because
there are no load-balancing issues and communication is localized. However,
these partitioning strategies fail when the graph is sparse and/or dynamic.
A Canonical High-Bandwidth Application
The following application structure occurs in at least two application areas:
1) climate simulation, and 2) high-resolution 3-D hydrodynamics with radiation
transfer ("radiation hydro").
Imagine a simulation problem with a vast spectrum of space and time scales.
For reasons of numerical stability, simulation of such a _stiff_ physical
system may require an implicit finite-difference scheme. Convergence may
demand such an approach; often, the reason is merely computational feasibilty.
Long-range communication will occur if the resulting time steps are big (the
reason we chose the implicit scheme in the first place). Implicit methods
for nonlinear PDEs solve a sparse linear system several times per time step.
Often, the sparse matrix is not very well-conditioned. The intensive use
---once for each of _many_ linear systems---of iterative solvers with embedded
sparse matrix-vector multiplies makes this simulation application massively
bandwidth intensive,
In radiation hydro, the slow half of the calculation is the hydrodynamics
calculation, which iterates over big time steps. The fast half is the
radiosity calculation, which requires the solution of systems of nonlinear
equations using Newton's method. This is iterated over small time steps.
The Newton corrections are found by solving sparse systems of linear
equations. Each big time step thus requires multiple invocations of a
linear-system solver. These solvers are themselves iterative. The end
result is that radiation hydro codes are massively bandwidth intensive.
System stiffness is one attribute that leads to high-bandwidth problems;
a second attribute with the same result is system heterogeneity. For
heterogeneous, e.g., multiphysics, problems, load imbalance is a serious
issue. Irregular, adaptive meshes are another source of imbalance. To
simplify, we could say that _stiffness_ and _adaptivity_ are the two
primary sources of high-bandwidth computational fluid dynamics problems
(more generally, problems of scientific simulation that depend on partial
differential equations).
Wriggling Off The Bandwidth Hook
It widely accepted in scientific computing that, if you need high-quality
results in the shortest possible execution time, there are many applications
that appear to be ideally suited to the high-bandwidth form of high-end
computers. Such applications are especially well-suited to symmetric
multiprocessors (SMPs), such as the classical flat shared-memory Cray parallel
vector processors and the Cray MTA-2, a multithreaded multiprocessor with flat
shared-memory. (In early incarnations, Cascade will not be completely flat).
They are not naturally suited to scalable MPP and cluster systems, and they
are not even ideally suited to emerging parallel vector systems such as the
Cray X1, because of the X1's nonuniform memory access.
Lacking an appropriate high-bandwidth system may prevent you from writing a
fresh application, even if the results would be very valuable to you. But if
you are dependent on a legacy application and the bandwidth of your machine
decays underneath you, you will probably try to adapt. For better or worse,
this is already happening.
The HECRTF report (p.63) states, "Numerous attempts are currently under way
to retool codes in application areas such as stockpile stewardship, global
climate modeling, computational fluid dynamics, local and regional weather
forecasting, aircraft design, cosmology, biomolecular simulation, materials
science, and quantum chemistry to run more efficiently on MPP architectures,
simply because they are the most plentiful systems currently available for
high-end computing". Obviously, these efforts divert attention away from the
urgent task of developing at least some high-bandwidth systems that would run
the original applications superbly without the need for retooling.
Because it affects the main argument for high-bandwidth systems, we need to
assess this retooling process to precisely characterize its prospects for
success---if this is at all possible. It turns out that, for large-scale
problems, iterative linear solvers need to be preconditioned so that their
convergence rates become acceptable. We examine a very representative (and
instructive) example strategy, due to David Keyes, in which domain-decomposed
multilevel methods are used to construct solvers with certain desirable
algorithmic properties.
Adapting Terascale PDE Solvers
David Keyes of Columbia University has a research program focused on the
discovery of preconditioners that would allow terascale optimal PDE solvers to
run well on systems whose bandwidth requirements have been relaxed, at least
in part. This has to be done without losing the ability to resolve any
relevant scale.
Once we resign ourselves to low-bandwidth systems, it follows that algorithms
must be: 1) highly concurrent and straightforward to load balance, 2) not
communication bound, 3) cache friendly (with strong temporal and spatial
locality of reference), and 4) highly scalable (in the sense of convergence
rate) as problem size and number of processors are increased in proportion.
Keyes believes that domain-decomposed multilevel methods based on Schwarz
preconditioners are "natural" for all these objectives. Here, "optimal" means
that the convergence rate is nearly independent of discretization parameters
(multilevel schemes for rapid linear convergence of linear problems and
Newton-like schemes for quadratic convergence of nonlinear problems) and also
that the convergence rate is as independent as possible of physical parameters
(continuation schemes and physics-based preconditioning).
The scientific imperative is to handle multiple-scale applications. There may
be multiple spatial scales (e.g., interfaces, fronts, and layers that are thin
relative to domain size). There may be multiple temporal scales (e.g., fast
waves with small transit times relative to convection, diffusion, or
group-velocity dynamics). The analyst must isolate the dynamics of interest
and model the rest in a system that can be discretized over a computable,
i.e., manageable, range of scales.
Spatial resolution stresses condition number; with ill-conditioning, a small
error in input may lead to a large error in output. Standard iterative
methods fail due to growth in the number of iterations. Direct methods fail
due to memory growth and bounded concurrency. What we can still try are
hierarchical (multilevel) iterative methods.
Temporal resolution stresses stiffness; failure to track the fastest mode may
lead to exponentially growing error in other modes. By definition,
multiple-timescale problems contain phenomena with very different relaxation
rates. The number of time steps corresponding to a given finite simulated
time grows very large in order to preserve stability. The solution is to
_step over_ fast modes by assuming quasi-equilibrium. (These are the big time
steps in implicit schemes). But this throws temporally stiff problems into
spatially ill-conditioned regimes.
Spatial resolution stresses memory size. Temporal resolution stresses raw
arithmetic performance. Both stress long-range communication latency, which
requires bandwidth to tolerate it, and together they _severely_ stress
local-memory bandwidth. Less severely stressed for PDEs, in principle, are
local-memory latency and long-range communication bandwidth, at least
according to Keyes.
What decomposition strategies are available to us in our search for sources of
parallelism and locality in PDE solvers?
Our starting point is a PDE operator 'L' acting on a velocity field 'u'
equated to a force term 'f' in some domain 'Omega'. Operator decomposition
(e.g., ADI) requires all-to-all, bulk data exchanges in each time step (for
transpose). Function-space decomposition (e.g., Fourier or spectral methods)
requires all-to-all, bulk data exchanges in each time step (for transform).
Domain decomposition (e.g., Schwarz) requires local (nearest-neighbor) data
exchanges, global reductions, and an optional small global problem (not needed
for the _parabolic_ case) in each time step. We have few options if global
communication is a bad thing!
Keyes is a proponent of Newton-Krylov-Schwarz algorithms. The Krylov method
needs to be preconditioned for acceptable inner-iteration convergence rates,
and this preconditioning is the "make-or-break" aspect of the implicit code.
The job of a preconditioner is to approximate the action of the Jacobian
inverse with acceptable computational cost.
It is not obvious that a good preconditioner approximating this inverse can
avoid extensive global communication. But, in _some_ cases, an additive
Schwarz preconditioner accomplishes this in a localized manner, with an
approximate solve in each subdomain of a partitioning of the global PDE domain
'Omega'. We precondition Ax = b to B^-1 Ax = B^-1 b, where finding a good
B^-1 is crucial to success. If we decompose a problem into a set of
overlapping subdomains 'Omega_i', we compute B^-1 by: 1) gathering data from
neighboring subdomains, 2) computing a local linear solve on each subdomain,
and 3) scattering partial solutions to neighboring subdomains.
In words, we derive an approximate solver for the problem as the sum of the
inverses of the restrictions of 'A' to a set of overlapping subdomains that
covers 'Omega'. Graph partitioning is one of the tools used in decomposing
a domain into subdomains.
How scalable is domain decomposition in practice? Keyes says, choosing good
decompositions is a balance between conditioning (scalable convergence) and
parallel computational complexity (e.g., the cost of computing restrictions
and prolongations). We might add another important trade-off in selecting an
effective preconditioner: choose between 1) a more complicated preconditioner
that engenders more global communication, and 2) a simpler preconditioner with
less global communication but a slower convergence rate.
What is the domain of applicability of domain-decomposition Schwarz methods?
Keyes says, the value of domain decomposition in practical problems is
ultimately problem specific. Your correspondent would like to go farther.
In his view, it is possible to find "good enough" Schwarz preconditioners for
a certain _class_ of PDE problems. There are many, many important problems
where no one has any idea for a good preconditioner, indeed, where it would be
a major mathematical discovery to find an effective preconditioner of _any_
kind.
Practitioners talk about the "difficulty" of the matrix 'A' in the linear
system Ax = b. How to say this mathematically is above your correspondent's
pay grade. Keyes has suggested a continuum from "good" to "bad" Helmholtz
problems (this is related to diagonal dominance in the Jacobian).
Up to now, the improvement to the convergence rate has been largely in the
constant term with little or no asymptotic improvement. This does not lift
us far enough to tackle Grand Challenge problems on conventional parallel
machines. Moreover, coupled systems of equations can lead to difficult
differential operators for which we lose optimality, with resulting poor
efficiency (slow convergence) as the problem size grows.
Conclusion: Bandwidth Expands Algorithm Choice
The Grand Challenge applications are characterized by 1) first-principles or
multiple-scale modeling, 2) billions of degrees of freedom, 3) real-time
simulation, 4) parametric studies, and 5) immersive interaction. We could
simulate lots of truly new physics if solver limits were steadily pushed back
(finer meshes, higher-fidelity models, more-complex coupling, etc.) on the
broadest possible range of problems. However, we cannot move to the next
level of science without employing new, high-bandwidth methods.
When high-bandwidth systems, such as the Earth Simulator, are available, we
can run bandwidth-intensive applications such as spectral atmospheric
simulation (AGCM). When they are not available, we tend to make do, or do
without: we settle for low-bandwidth methods.
We have identified fully implicit methods, applications with nonlocalizable
data bases, and adaptive mesh-refinement algorithms as three instances of
high-bandwidth methods. The first two are instances of computing with global
domains of influence, while the third is an instance of a constantly shifting
computational focus of attention.
The quantum-chemistry code GAMESS has a database of 2-electron integrals that
does not partition. For this reason, it scales to only 32 or 64 processors,
because of communication requirements. So, the chemistry community only runs
problems that complete in a reasonable amount of time on that number of
processors.
Apparently, the adaptive-grid community either doesn't use adaptive grids or
restructures rarely, again, because of communication requirements.
Implicit methods are vital. In any solution method, information flow across
the domain creates a gap between 1) physical applications involving, at some
fundamental level, all-to-all dependences between the unknown quantities; and
2) low-bandwidth parallel systems with high communication costs. (Runtime is
how long it takes the information to cross the domain). Domain-decomposition
Schwarz preconditioning methods seek to bridge this gap (there are other
preconditioning strategies). When a "good" Schwarz decomposition can be
found, we can approximate the inverse action of a discretized PDE operator
with minimal global communication.
But why should we tie our hands by excluding effective preconditioners that
engender significant global communication? We will make no progress on
Grand Challenge problems until we have deep insight into their mathematical
and physical structure. The result might well be a preconditioner that has
lots of global communication and only runs well on a high-bandwidth system.
Similarly, why should we tie our hands by avoiding dynamic load rebalancing?
We have made no real progress on Grand Challenge problems because decoupling
unknowns in a fully implicit method is anything but a routine task; sometimes
it is even unmanageable. In general, by catering to low-bandwidth systems,
our algorithm choice has been too greatly restricted. For high sustained
performance and scalability, Grand Challenge applications must employ new,
high-bandwidth methods. Such methods only run well on new, high-bandwidth
systems.
The High-End Crusader, a noted expert in high-performance computing and
communications, shall remain anonymous. He alone bears responsibility for
these commentaries. Replies are welcome and may be sent to HPCwire editor
Tim Curns at tim@hpcwire.com.
|