
Features:
HIGH-END CRUSADER REFUTES BLUE GENE/L DEFENDER
Commentary from the High-End Crusader
An anonymous respondent, possibly an employee of the U.S. government with a
special interest in Blue Gene/L, this week criticizes HEC's (short) article
"High-End Crusader Questions IBM's 'Fastest' Computer Claim" [108466], which
appeared in last week's HPCwire. For memory, the thesis of that article was
that Blue Gene/Light is a special-purpose parallel computer that is only
suited to applications that persistently inhabit a small portion of the
3-dimensional locality "phase space", in contrast to other applications that
sweep out vast swaths of locality phase space during the same or different
computations controlled by the same program.
The issues raised by the respondent, here dubbed "AR", touch on some of the
central myths of the conventional wisdom on high-end computing. Since these
myths play a major role in _enabling_ the continuing productivity crisis in
high-end computing, their rational examination is of the highest importance.
AR begins by pointing out that HEC did not comment on the advanced packaging
technology that allows Blue Gene/L to deliver the same peak arithmetic
performance for significantly less cost, less power, and less space---in
comparison to some other high-performance computers (the example we both used
was the Japanese Earth Simulator). This is true. Working against a tight
deadline, HEC did not discuss packaging at all, clearly a significant and
praiseworthy achievement by the Blue Gene engineers. But the argument from
cost is disingenuous, to say the least.
The consistent framework for HEC's argumentation over the last decade has been
that there are _two_ distinct kinds of high-performance computer system. On
the one hand, there are strongly parallel, high-bandwidth systems (capability
systems, called "Type C" systems by Burton Smith), whose dominant cost lies in
their (expensive) system interconnect. On the other hand, there are weakly
parallel, low-bandwidth systems (capacity systems, called "Type T" systems by
Burton Smith), whose dominant cost lies in their (cheap) transistors---in
processors and memory. ("C" is for [inter]connect; "T" is for transistor).
Parallelism strength is determined by the number and kind of parallelism
mechanisms in the processor. Such mechanisms are used to tolerate memory,
network, and synchronization latencies, among others. While a Type-T system
is sufficient to achieve a respectable Linpack R_max value, only a Type-C
system can achieve a respectable Gups number or respectable performance on
sparse matrix-vector multiply (in the general case).
Of course, the latest thinking about Type-C systems is that they should
incorporate diverse, compatible locality mechanisms in order to complement
their (strong) parallelism mechanisms. The new Type-C slogans are: "No
parallelism left behind!" and "No locality left behind!".
The first punchline of the rebuttal, therefore, is that achieving Linpack
performance is inexpensive: if you need only floating-point performance and
very little in the way of memory access, communication, or synchronization,
then you don't need to spend money on the routers, wires, connectors, flex
circuits, etc., etc., that are necessary to move data around, not just compute
rapidly with them if they happen to be close at hand.
In "Top500 And HPCC Benchmarks---What They Can And Can't Do" [107896]), HEC
proposed an empirical theory of program-locality behavior. Locality "phase
space" was construed as consisting of the 3-dimensional Cartesian product of:
1) short-range or long-range communication, 2) temporally local or not, and
3) spatially locally or not. (Analysis shows that not all eight phases are
distinct). The idea was that, during a computation, programs migrate from one
locality phase to another---in essence, making locality phase transitions---
very possibly tracing out different phase-space trajectories on different runs
of the same program because of different inputs. The requirement was that
parallel computation needs to be efficient in _every_ region of locality phase
space.
Now, if someone wants to build a computer that specializes in one small corner
of this phase space, this is fine---as long as (spurious) claims of having
achieved general-purpose parallel computing do not damage big science, big
national security, or big industry (such as energy or aerospace) by misleading
people into thinking that they now have the mix of computers they need. There
is nothing wrong with special-purpose parallel computers. But they can't
replace general-purpose parallel computers, which---by definition---compute
with high parallel efficiency throughout _all_ of locality phase space.
Used as the only benchmark, Linpack is largely uninformative because it is
representative only of programs that lie at the origin of this locality phase
space (Cartesian product), namely, in locality phase (short-range, temporally
local, spatially local).
First Myth
Having warmed up, AR launches into defending a truly pernicious myth: that
price/performance is a universably reasonable guideline for procuring high-end
systems. This is flat wrong: price/performance can be justified only if the
project is time insensitive, i.e., if the utility function that measures the
value delivered to the project as a function of the elapsed time to solution
is a constant independent of time.
But are utility functions constant? Certainly not, in general. Often a
national-security utility function will be a step function, constant for 'x'
amount of time and zero after that. In any situation where there is great
value in knowing the solution as soon as possible (perhaps because expensive
resources are sitting idle), we expect the utility function to monotonically
decrease over time.
Intuitively, if time means nothing to you, then, by all means, buy a cheap
computer. But if time means something to you, perhaps you should give some
thought to absolute performance. Indeed, when time to solution matters, you
should buy a capability computer. How much capability is reasonable to buy is
determined by precisely _how_ important time to solution is to you.
More often than not, price/performance is the wrong procurement metric. What
is needed is the ratio of the utility function to the cost function, as will
now be explained. This rejoins some work done by Marc Snir for the PERCS
project---funded by DARPA's HPCS program---when Snir was still at IBM.
Let us fix the application/algorithm 'A' and allow the system 'S' to vary.
Determined by system 'S' (considering both programming and execution times),
there will be a _given_ time to solution, denoted tts_S, required to compute
application/algorithm 'A'. Let u(t) be the utility function and let c(t) be
the cost function (for example, total cost of ownership from procurement to
solution). By taking the ratio u(tts_S)/c(tts_S), we get some measure of the
return on investment of buying and then programming system 'S' to compute
application/algorithm 'A' relative to utility function u(t).
Note that letting system 'S' compute longer just decreases u(t) while
increasing c(t)---because of the extra runtime---thus lowering the return on
investment. The quantity tts_S is thus the _minimal_ time to solution for
application 'A' on system 'S'. By maximizing the ratio u(tts_S)/c(tts_S) over
all systems 'S', you maximize your return on investment.
One might think that being guided by the utility/cost ratio is mandated by
high-end problems only. This would be a mistake. Utility/cost ratios should
replace price/performance metrics even in time-sensitive mid-range projects.
However, it is true that time-dependent utility functions with wide variations
are more _visible_ in high-end projects.
The astute reader may ask if this utility/cost productivity metric assumes
that we only run one application/algorithm 'A' on system 'S'. Not really.
The general productivity crisis in high-end computing is that we have entered
a period of decreasing productivity as measured by the increasing programming
skill required, each time a significantly higher-capability machine becomes
available to an organization, for that organization's programmers to rapidly
make a deep impact on the organization's mission, over an extended period of
time, by repeatedly developing new applications for the new machine. The
utility/cost metric can easily be generalized to handle this case.
Again, price/performance _is_ the appropriate metric when the project is time
insensitive. The logical flavor of all these myths is to assume that the
particular subsumes the general.
A Dangerous Rival
AR then attacks the HPC Challenge benchmark suite itself, which is apparently
a dangerous rival to the continued use of the Linpack benchmark alone, by
claiming that the reporting procedures for HPCC results haven't been properly
institutionalized. To be polite, your correspondent doesn't grasp what AR is
driving at. In some sense, it is even beside the point: if the HPC Challenge
benchmark suite didn't exist (or wasn't properly institutionalized), then we
would have to invent it (or fix it up).
Benchmarks are used to estimate the performance of real-world applications on
new computer systems. Arithmetic performance is paced by the rate at which
data operands can be supplied to the processor. In turn, this rate is
determined by the system, surely, but also by the algorithm, that is, by:
1) whether communication is short-range or long-range, 2) whether there is
temporal locality or not, and 3) whether there is spatial locality or not.
For example, is the program currently engaged in unit-stride or random memory
accessing? Two questions are key. What is its current locality phase? What
is its current computational rate?
These differences in memory-accessing patterns (cf. [107896]) mean that there
are particular quantities of different types of work that are performed by the
system at different computational rates. We can estimate the quantities of
these different types of work by analyzing our application, but we have no
idea of what the computational rate for performing work of type 'abc' on
system 'S' is---unless we have a benchmark that measures how fast system 'S'
performs work of type 'abc'. Multiple benchmark numbers are required
precisely because programs are heterogeneous in their computational rates.
In simple English, programs undergo transitions from one computational rate
to another as the computation unfolds; a suite of carefully designed
benchmarks is required to cover (provide information about) the main
computational rates in order to estimate the performance of the entire
program.
Linpack is representative of only _one_ of those computational rates, namely,
the one that corresponds to extremely high local and global spatial and
temporal locality. It is not that Linpack is embarrassingly parallel; rather,
it is that the dense linear-algebra Linpack is embarrassingly localizable or,
more precisely, _embarrassingly local_ after years of tuning.
Second Myth
Having hit his stride, AR introduces another pernicious myth, namely, that
over time, all (or most) applications migrate to the origin of the locality
phase space as the result of heroic efforts by algorithm designers. In
"HEC Analysis: Who Needs High-Bandwidth Applications?" [108052], HEC studied
---starting from first principles---the prospects for success of the on-going
efforts to "retool codes in application areas such as stockpile stewardship,
global climate modeling, computational fluid dynamics, local and regional
weather forecasting, aircraft design, cosmology, biomolecular simulation,
materials science, and quantum chemistry to run more efficiently on MPP
architectures", as the HECRTF report puts it (p. 63).
HEC offered a canonical high-bandwidth application that could not be retooled
without loss of either resolution or acceptable runtime. To quote HEC:
"The following application structure occurs in at least two application areas:
1) climate simulation, and 2) high-resolution 3-D hydrodynamics with radiation
transfer ('radiation hydro').
Imagine a simulation problem with a vast spectrum of space and time scales.
For reasons of numerical stability, simulation of such a _stiff_ physical
system may require an implicit finite-difference scheme. Convergence may
demand such an approach; often, the reason is merely computational
feasibilty.
Long-range communication will occur if the resulting time steps are big (the
reason we chose the implicit scheme in the first place). Implicit methods
for nonlinear PDEs solve a sparse linear system several times per time step.
Often, the sparse matrix is not very well-conditioned. The intensive use
---once for each of _many_ linear systems---of iterative solvers with
embedded sparse matrix-vector multiplies makes this simulation application
massively bandwidth intensive.
In radiation hydro, the slow half of the calculation is the hydrodynamics
calculation, which iterates over big time steps. The fast half is the
radiosity calculation, which requires the solution of systems of nonlinear
equations using Newton's method. This is iterated over small time steps.
The Newton corrections are found by solving sparse systems of linear
equations. Each big time step thus requires multiple invocations of a
linear-system solver. These solvers are themselves iterative. The end
result is that radiation hydro codes are massively bandwidth intensive.
System stiffness is one attribute that leads to high-bandwidth problems;
a second attribute with the same result is system heterogeneity. For
heterogeneous, e.g., multiphysics, problems, load imbalance is a serious
issue. Irregular, adaptive meshes are another source of imbalance. To
simplify, we could say that _stiffness_ and _adaptivity_ are the two
primary sources of high-bandwidth computational fluid dynamics problems
(more generally, problems of scientific simulation that depend on partial
differential equations)".
Your correspondent does not deny that amazing algorithmic progress has
occurred---in some instances. Realistically, though, what are the chances
that there is a general trend of algorithm changes that localize applications?
There is an extensive history of that not happening. Hard experience gained
over a decade or more has made it abundantly clear that there are many
applications that simply do not migrate to the desired small corner of
locality phase space. These are the Type-C applications, characterized by
significant fine-grained long-range communication, poorly balanced workloads,
sparse linear algebra, implicit methods, irregular, adaptive meshes, and in
which the programming effort is _not_ amortizable.
An Outrageous Suggestion
Finally, having implicitly granted that high-bandwidth systems may be easier
to program, AR makes the truly outrageous suggestion that it may be a good
thing that our high-performance systems are so burdensome to program. This
is ludicrous and wrong and pernicious. This attitude is more than enough to
guarantee that general-purpose parallel computing will never enter the
computing mainstream. Indeed, it is inimical to the evolution that parallel
computing must undergo if computing as a whole hopes to ever tackle those
Grand Challenge problems that were once our goal but which have been quietly
de-emphasized as we repeatedly fail to solve them. In part, the Grand
Challenge problems are high-bandwidth problems that cannot be computed on
low-bandwidth systems.
AR's language is mind boggling: "Have we applied the right computer science
to these applications? ... The debate then is whether public funds should be
used to train better computer scientists who can move these applications down
to run on cheap machines, or whether it should go to build expensive machines
that can accomodate inexperienced programmers and tolerate sloppy
algorithms/programming?".
Gifted computational scientists with solid experience at the national labs,
and exposure to several instances of both high-bandwidth and low-bandwidth
systems (often implementing the same project on widely diverse machines),
have reported their intense frustration in spending months on performance
tuning for low-bandwidth machines with very mediocre performance improvement.
Again, the productivity crisis in high-end computing is that, as time passes,
we require higher and higher degrees of programming skill---present in smaller
and smaller fractions of our elite workforce---to extract proportionate value
from each major architectural advance in high-end systems.
Significant mission impact comes from creating fresh applications to compute
new things that have never been computed before, and the time to program and
tune these fresh applications is increasing. The reason is that we don't have
a good solution to the problem of parallel programming. Our systems don't
support good programming abstractions and we don't have a good programming
language to make these abstractions persistent.
Some observers go so far as to report, for the highest-end machines, a
decreasing number of users of these systems, a decreasing number of people
writing programs for them, a decreasing number of new programs written for
them per unit time, and a decreasing number of institutions that care.
We all agree that the true high-end market has shrunk. The applications are
out there, but by and large users have abandoned their attempts to program
them. Algorithms are becoming more sophisticated. The chief result of
sophistication is increased complexity. No programmer can deal with this
complexity unless architectures and language systems reduce the burdens of
parallel programming to the point where skilled computational scientists and
software engineers have at least a fighting chance to rapidly develop fresh
applications that provide deep mission impact---in part because it is not a
sisyphean task (cf. Greek mythology) to achieve acceptable performance.
It is not our sloppy programmers who are complaining (or leaving the field);
it is our _best_ programmers.
Conclusion
There are two consistent and diametrically opposed worldviews here. The
simplest distinction is that one view says that computing needs _both_ Type-T
_and_ Type-C systems, while the other view hopes to get by with Type-T systems
alone. The other divergences follow. 1) One view tries to complement Linpack
with the HPC Challenge benchmark suite; the other view attacks HPCC. 2) One
view introduces utility/cost as a productivity metric; the other view insists
we stick with price/performance. 3) One view stresses the difficulty of
localizing demanding applications (considering locality in all its forms,
including load balancing); the other view claims that all (or most)
applications eventually localize, if only we try hard enough. 4) One view
bemoans the difficulty of parallel programming; the other view exults in it.
Faced with such a comprehensive (and misguided) ideology, your correspondent
almost doesn't know where to begin. He only says: Look at the record; history
is the best guide.
The High-End Crusader, a noted expert in high-performance computing and
communications, shall remain anonymous. He alone bears responsibility for the
opinions expressed in this article. For all articles, replies are welcome and
may be sent to HPCwire editor Tim Curns at tim@hpcwire.com.
|