HPCwire
 The global publication of record for High Performance Computing / October 8, 2004: Vol. 13, No. 40

  |  Table of Contents  |  

Features:

LLNL SCIENTIST REFUTES HIGH-END CRUSADER'S LINPACK CLAIMS
LETTER TO THE EDITOR

"In another recent High-End Crusader article in HPCwire ("Top500 And HPCC Benchmarks---What They Can And Can't Do" [107896]), your correspondent argued that Linpack is woefully inadequate when it is used as the _only_ benchmark for parallel computers. Essentially, Linpack only gives us performance estimates for programs with artificial locality characteristics. A Linpack result only tells you how well a parallel computer runs a program with an abundance of local and global spatial and temporal locality. Linpack needs to be complemented with other benchmarks that tell you how well programs with totally different locality characteristics run." -- The High-End Crusader, 108466.html


I do not disagree with HEC's claim that the Linpack benchmark by itself tells us only a limited amount about the computer on which it is run, though as a minor aside, I find this somewhat tautological in that *any* single data point is necessarily of limited use. Were I buying a large supercomputer, I most certainly would insist on more performance data than just the Linpack benchmark result.

However, as the lead for Lawrence Livermore National Lab's Linpack benchmark team on ASCI Blue in 1999 and as a member of the ScaLapack team at the University of Tennessee before that, I have spent considerable time working on algorithms and software for optimizing large, dense linear algebra computations, including the LU factorization with partial pivoting computation that defines the Linpack benchmark, and I can say with confidence that the Linpack is anything *but* an easily parallelizable computation and that it involves a huge amount of very fine-grained communications and synchronizations. What localization of communication and computation is exploited in today's Linpack benchmark codes is the result of extremely clever algorithmic and data-structure developments, as evidenced by the complexity of HPCLinpack, the code developed at the University of Tennessee (largely by Antoine Pettitet and Clint Whaley) that served as the basis of my Linpack benchmarking effort and I believe several other subsequent efforts: HPCLinpack is at least one if not two orders of magnitude more complex than a straightforward implementation of LU factorization with pivoting, that complexity needed to implement the recursive algorithms that maximize spatial and temporal locality.

At each step of the computation, where a step corresponds to a single (block) row of the matrix being factored, fine-grained communication must be used to communicate information between *every single processor* in the parallel computer (it is not an all to all communication, but rather a set of parallel one-to-many communications), and computation cannot proceed until that communication is completed. This is definitely *not* the kind of easy, coarse- grained parallelism that the HEC seems to be implying: there is a distinctly global, tightly synchronized nature to the Linpack computation. Many other computations map much, much more easily onto parallel computers, e.g. those with very decoupled physical domains on which considerable computation can be done independently and only nearest-neighbor communication is required.

If there is an aspect of the Linpack computation that makes it "easier" than other parallel benchmarks, it is *regularity*. The Linpack computation is fine-grained and requires much communication, but the computation is extremely *regular*, and that regularity can be exploited to minimize synchronization delays and thus keep the processors performing useful work throughout the bulk of the computation, as long as the communication requirements do not create a delay. As I said, algorithmic developments developed over a considerable body of research on the subject have provided the advances needed to keep the computation flowing, at least until the point that the remaining matrix becomes too small to cover the communication delays: the efficiency of the calculation decreases toward the end of the benchmark. Many realistic codes do not share this property of regularity.

I think it is important to properly characterize the Linpack benchmark as not being "embarrassingly parallel", which seems to be a possible inference from HEC's comments on the benchmark. It is anything but: researchers have had to work very hard to extract the parallel efficiency that they currently get, and even then, that efficiency is only achieved on enormous problems.

Best Regards,
Andy Cleary

Andrew J. Cleary, Software Engineer/Computational Scientist Lawrence Livermore National Labs, 925-424-5890.


Top of Page

  |  Table of Contents  |