
Features:
HIGH-END CRUSADER QUESTIONS IBM'S "FASTEST" COMPUTER CLAIM
Commentary from the High-End Crusader
On Wednesday, IBM claimed title to the world's fastest supercomputer by
reporting that a Blue Gene/L system sited at the IBM lab in Rochester,
Minnesota had achieved a Linpack-benchmark performance of 36.01 TFs/s,
narrowly edging out the Linpack-benchmark performance of the Earth Simulator,
which is only 35.86 TFs/s. Since the Earth Simulator has a peak performance
of 40.96 TFs/s and this particular Blue Gene/L system has a peak performance
of just over 45 TFs/s, the Earth Simulator is sustaining a somewhat higher
fraction of its peak on Linpack. IBM has accomplished something. Still,
before breaking out the champagne, we are advised to step back and try to gain
some perspective about what these numbers mean. They certainly mean something
but may not bear all the weight that is being put on them by the media and
marketing people.
In truth, the Linpack race is becoming a private party for Linux clusters. The
winner is just the biggest cluster at the time that doesn't break and has a
few robust locality mechanisms that exploit the abundance of local and global
spatial and temporal locality in the Linpack benchmark. If the machine being
developed at NASA achieves a peak of 50 TFs/s and does not sink beneath 80%
efficiency on Linpack, it will beat both the current Earth Simulator and this
particular configuration of Blue Gene/L.
But just how valuable is it to win the Linpack race? How much, if anything,
does this have to do with developing a general-purpose parallel computer that
can tackle the full range of problems this nation needs to solve? What is a
_general-purpose_ parallel computer anyway?
The original working title of the most recent High-End Crusader article in
HPCwire ("High-End Computing Needs Radical Programming Change" [108384]) was
"High-Productivity General-Purpose Parallel Computing". The plan for that
article was to combine two themes. First, as has been often argued here,
there is a natural division between high-bandwidth applications/algorithms,
which---in the _most_ demanding case---engage in frequent fine-grained long-
range communication and thus require strongly parallel, high-bandwidth systems
in order to be computed efficiently, and low-bandwidth
applications/algorithms, which---in the _least_ demanding case---engage in
infrequent coarse-grained short-range communication and thus may be computed
efficiently on almost any parallel architecture, including weakly parallel,
low-bandwidth systems such as Linux clusters.
Second, as has often been suggested here, conventional parallel machines,
i.e., clusters of scalar SMP nodes that communicate among themselves using
MPI, have become increasingly burdensome to program. In particular, severe
nonuniformity of memory access has led to tight coupling of control and data
decomposition. This has produced an unfortunate tradeoff between locality and
parallelism for high performance on a given architecture. The solution to
this problem is a synergistic mix of architectural improvements and
improvements in the programming-language system, in particular the design of
new programming abstractions and language constructs for general-purpose
parallel programming.
Why does language enter here? Well, do computer architects need reminding
that architecture and language are inextricably linked and that, to improve
either, we need to improve both? Programming languages obviously need
architectural support but they themselves lead to such things as 1) relieving
the burden of parallel programming to enhance programmer productivity, 2)
allowing fine-grained anonymous communication, 3) exploiting diverse forms of
parallelism and locality, and 4) driving computer architecture in the right
direction.
Also, the current design thrust in high-bandwidth systems is to combine a
broad range of parallelism mechanisms with a broad range of locality
mechanisms---all compatible with each other---so that no form of parallelism
and no form of locality need be left on the "compiler-room" floor. But given
our failure to design localizing compilers, we should abandon our reliance, in
the general---often dynamic irregular---case, on general-purpose compiler
algorithms that synthesize---without any help from the programmer---programs
that handle _all_ forms of localization: moving data up, down, and across the
memory hierarchy as the computation unfolds. Instead, linguistic
specification of locality in programs was proposed as a feasible division of
labor between the programmer and the localizing (and parallelizing) compiler.
Indeed, the linguistic specification of locality in a general and portable way
is a desirable goal for a parallel programming language designer. The extent
to which it is effective will determine the mix of latency tolerance (use of
parallelism mechanisms) and latency avoidance (use of locality mechanisms) in
general-purpose machines.
In another recent High-End Crusader article in HPCwire ("Top500 And HPCC
Benchmarks---What They Can And Can't Do" [107896]), your correspondent argued
that Linpack is woefully inadequate when it is used as the _only_ benchmark
for parallel computers. Essentially, Linpack only gives us performance
estimates for programs with artificial locality characteristics. A Linpack
result only tells you how well a parallel computer runs a program with an
abundance of local and global spatial and temporal locality. Linpack needs to
be complemented with other benchmarks that tell you how well programs with
totally different locality characteristics run.
This is not to say that no programs have (varying-length) intervals of
abundant locality. It _is_ to say that not all programs have nothing but
abundant locality throughout their run. The truth is that real programs pass
through locality phases. This is the underlying justification for the HPC
Challenge benchmark suite, which proposes an empirical theory of locality
phases. To repeat (cf. [107896]), the locality phases are the elements in the
Cartesian product of: 1) long-range or short-range communication, 2)
temporally local or not, and 3) spatially local or not. Linpack lies at the
origin of this Cartesian product (short-range, temporally local, spatially
local).
It is the _non-Linpack_ benchmarks in the HPC Challenge benchmark suite that
make useful distinctions between, for example, the Earth Simulator and Blue
Gene/L. Let us roughly characterize these two architectures to decide which
one comes closer to being a general-purpose parallel computer.
The Earth Simulator is a cluster of vector SMP nodes that communicate among
themselves using MPI. The ES is characterized by an unusually high bandwidth
to both local memory _and_ remote memory across the interconnection network.
If your correspondent remembers correctly, the bandwidth to local memory is 16
GBs/s, while the inter-node bandwidth is also 16 GBs/s if the MPI overhead is
neglected (otherwise, it falls to 12 GBs/s). The vector processors are
strongly parallel. Hence, all latency to local memory can be tolerated. In
contrast, MPI has a parallelism equal to the message length. Therefore, the
ES will compute applications/algorithms superbly well if they have any kind of
short-range communication but only if their long-range communication is
essentially coarse grained. For coarse-grained long-range communication, the
ES has an effective bandwidth equal to its hardware bisection bandwidth, which
is possibly 10.24 TBs/s (there are some technical subtleties here but the
bisection bandwidth is clearly a lot).
In benchmark terms, the ES would have great performance numbers on Linpack,
Stream, local Gups, Ptrans, but not global Gups (again, cf. [107896]). It
would be an HPC Challenge winner.
Blue Gene/L was originally conceived for protein-folding simulation. Again,
if memory serves, this particular application/algorithm essentially engages in
fine-grained short-range communication. By extension, we should ask if Blue
Gene/L is _only_ suitable for applications/algorithms with this pattern of
accessing memory. Your correspondent suspects that this is the case. Compared
to the ES, Blue Gene/L is _not_ a general-purpose parallel computer.
Blue Gene/L is a hierarchically local architecture. Mini single-chip, dual-
processor "units" are combined with SDRAM-DDR external-memory chips on compute
cards. Sixteen compute cards are grouped in a single node board. Thirty-two
node boards are grouped to form a cabinet. Sixty-four cabinets are grouped to
form a system.
Blue Gene/L does _not_ have the rich bandwidths of the ES. According to one
report, the bandwidth to local memory is 6 GBs/s, while the system bandwidth
for MPI communication between node boards is 2 GBs/s. The PowerPC 440GX
processors are weakly parallel. Latency won't be a problem for communication
within a compute card. Your correspondent suspects that it will be a problem
within a node board and a major problem between node boards. This system
looks like it will perform superbly only on strongly localized applications.
Blue Gene/L, more than most systems, needs a language system with locality
specification if it wishes to target more than a narrow locality spectrum of
applications.
IBM could report the results for Blue Gene/L on the entire HPC Challenge
benchmark suites, as separate performance numbers. In benchmark terms, your
correspondent would expect the ES to outperform Blue Gene/L on all remaining
benchmarks (he doesn't have all the speeds and feeds, and doesn't know how
compute cards communicate). Except very locally, we appear to have both a
parallelism deficit and a bandwidth deficit. This is a very preliminary
assessment.
One thing is clear: edging out the ES on Linpack doesn't mean much for
anything other than strongly localizable applications.
With the Bush administration's failure to fund HECRTF (don't believe
statements to the contrary), we are hanging by a thread called DARPA. Some
people are smart enough to "get" high-end computing. Some of these people
helped defined DARPA's HPCS program, which identified a productivity---not
just a performance---crisis in HEC. This involves architectures and
languages, among other things.
HECRTF is necessary in order to get university research in high-end computing
restarted. With the exception of very few individuals, academic research in
high-end computing is a vast wasteland. It would be a bitter irony if
satisfaction that an American vendor can squeak past the ES on Linpack (so
what?) were to remove the sense of crisis---in part caused by the ES---and
allow policy makers to get away with the _national scandal_ of not funding
HECRTF, with fresh money. My God, what are peoples' priorities?
The High-End Crusader, a noted expert in high-performance computing and
communications, shall remain anonymous. He alone bears responsibility for the
opinions expressed in this piece. As with regular articles, replies are
welcome and may be sent to HPCwire editor Tim Curns at tim@hpcwire.com.
|