
Features:
LLNL SCIENTIST REFUTES HIGH-END CRUSADER'S LINPACK CLAIMS
LETTER TO THE EDITOR
"In another recent High-End Crusader article in HPCwire ("Top500 And HPCC
Benchmarks---What They Can And Can't Do" [107896]), your correspondent argued
that Linpack is woefully inadequate when it is used as the _only_ benchmark
for parallel computers. Essentially, Linpack only gives us performance
estimates for programs with artificial locality characteristics. A Linpack
result only tells you how well a parallel computer runs a program with an
abundance of local and global spatial and temporal locality. Linpack needs to
be complemented with other benchmarks that tell you how well programs with
totally different locality characteristics run." -- The High-End Crusader,
108466.html
I do not disagree with HEC's claim that the Linpack benchmark by itself tells
us only a limited amount about the computer on which it is run, though as a
minor aside, I find this somewhat tautological in that *any* single data point
is necessarily of limited use. Were I buying a large supercomputer, I most
certainly would insist on more performance data than just the Linpack
benchmark result.
However, as the lead for Lawrence Livermore National Lab's Linpack benchmark
team on ASCI Blue in 1999 and as a member of the ScaLapack team at the
University of Tennessee before that, I have spent considerable time working on
algorithms and software for optimizing large, dense linear algebra
computations, including the LU factorization with partial pivoting computation
that defines the Linpack benchmark, and I can say with confidence that the
Linpack is anything *but* an easily parallelizable computation and that it
involves a huge amount of very fine-grained communications and
synchronizations. What localization of communication and computation is
exploited in today's Linpack benchmark codes is the result of extremely clever
algorithmic and data-structure developments, as evidenced by the complexity of
HPCLinpack, the code developed at the University of Tennessee (largely by
Antoine Pettitet and Clint Whaley) that served as the basis of my Linpack
benchmarking effort and I believe several other subsequent efforts: HPCLinpack
is at least one if not two orders of magnitude more complex than a
straightforward implementation of LU factorization with pivoting, that
complexity needed to implement the recursive algorithms that maximize spatial
and temporal locality.
At each step of the computation, where a step corresponds to a single (block)
row of the matrix being factored, fine-grained communication must be used to
communicate information between *every single processor* in the parallel
computer (it is not an all to all communication, but rather a set of parallel
one-to-many communications), and computation cannot proceed until that
communication is completed. This is definitely *not* the kind of easy, coarse-
grained parallelism that the HEC seems to be implying: there is a distinctly
global, tightly synchronized nature to the Linpack computation. Many other
computations map much, much more easily onto parallel computers, e.g. those
with very decoupled physical domains on which considerable computation can be
done independently and only nearest-neighbor communication is required.
If there is an aspect of the Linpack computation that makes it "easier" than
other parallel benchmarks, it is *regularity*. The Linpack computation is
fine-grained and requires much communication, but the computation is extremely
*regular*, and that regularity can be exploited to minimize synchronization
delays and thus keep the processors performing useful work throughout the bulk
of the computation, as long as the communication requirements do not create a
delay. As I said, algorithmic developments developed over a considerable body
of research on the subject have provided the advances needed to keep the
computation flowing, at least until the point that the remaining matrix
becomes too small to cover the communication delays: the efficiency of the
calculation decreases toward the end of the benchmark. Many realistic codes do
not share this property of regularity.
I think it is important to properly characterize the Linpack benchmark as not
being "embarrassingly parallel", which seems to be a possible inference from
HEC's comments on the benchmark. It is anything but: researchers have had to
work very hard to extract the parallel efficiency that they currently get, and
even then, that efficiency is only achieved on enormous problems.
Best Regards,
Andy Cleary
Andrew J. Cleary, Software Engineer/Computational Scientist
Lawrence Livermore National Labs, 925-424-5890.
|