HPCwire
 The global publication of record for High Performance Computing / October 1, 2004: Vol. 13, No. 39

Previous Article   |  Table of Contents  |  

Features:

HIGH-END CRUSADER QUESTIONS IBM'S "FASTEST" COMPUTER CLAIM
Commentary from the High-End Crusader

On Wednesday, IBM claimed title to the world's fastest supercomputer by reporting that a Blue Gene/L system sited at the IBM lab in Rochester, Minnesota had achieved a Linpack-benchmark performance of 36.01 TFs/s, narrowly edging out the Linpack-benchmark performance of the Earth Simulator, which is only 35.86 TFs/s. Since the Earth Simulator has a peak performance of 40.96 TFs/s and this particular Blue Gene/L system has a peak performance of just over 45 TFs/s, the Earth Simulator is sustaining a somewhat higher fraction of its peak on Linpack. IBM has accomplished something. Still, before breaking out the champagne, we are advised to step back and try to gain some perspective about what these numbers mean. They certainly mean something but may not bear all the weight that is being put on them by the media and marketing people.

In truth, the Linpack race is becoming a private party for Linux clusters. The winner is just the biggest cluster at the time that doesn't break and has a few robust locality mechanisms that exploit the abundance of local and global spatial and temporal locality in the Linpack benchmark. If the machine being developed at NASA achieves a peak of 50 TFs/s and does not sink beneath 80% efficiency on Linpack, it will beat both the current Earth Simulator and this particular configuration of Blue Gene/L.

But just how valuable is it to win the Linpack race? How much, if anything, does this have to do with developing a general-purpose parallel computer that can tackle the full range of problems this nation needs to solve? What is a _general-purpose_ parallel computer anyway?

The original working title of the most recent High-End Crusader article in HPCwire ("High-End Computing Needs Radical Programming Change" [108384]) was "High-Productivity General-Purpose Parallel Computing". The plan for that article was to combine two themes. First, as has been often argued here, there is a natural division between high-bandwidth applications/algorithms, which---in the _most_ demanding case---engage in frequent fine-grained long- range communication and thus require strongly parallel, high-bandwidth systems in order to be computed efficiently, and low-bandwidth applications/algorithms, which---in the _least_ demanding case---engage in infrequent coarse-grained short-range communication and thus may be computed efficiently on almost any parallel architecture, including weakly parallel, low-bandwidth systems such as Linux clusters.

Second, as has often been suggested here, conventional parallel machines, i.e., clusters of scalar SMP nodes that communicate among themselves using MPI, have become increasingly burdensome to program. In particular, severe nonuniformity of memory access has led to tight coupling of control and data decomposition. This has produced an unfortunate tradeoff between locality and parallelism for high performance on a given architecture. The solution to this problem is a synergistic mix of architectural improvements and improvements in the programming-language system, in particular the design of new programming abstractions and language constructs for general-purpose parallel programming.

Why does language enter here? Well, do computer architects need reminding that architecture and language are inextricably linked and that, to improve either, we need to improve both? Programming languages obviously need architectural support but they themselves lead to such things as 1) relieving the burden of parallel programming to enhance programmer productivity, 2) allowing fine-grained anonymous communication, 3) exploiting diverse forms of parallelism and locality, and 4) driving computer architecture in the right direction.

Also, the current design thrust in high-bandwidth systems is to combine a broad range of parallelism mechanisms with a broad range of locality mechanisms---all compatible with each other---so that no form of parallelism and no form of locality need be left on the "compiler-room" floor. But given our failure to design localizing compilers, we should abandon our reliance, in the general---often dynamic irregular---case, on general-purpose compiler algorithms that synthesize---without any help from the programmer---programs that handle _all_ forms of localization: moving data up, down, and across the memory hierarchy as the computation unfolds. Instead, linguistic specification of locality in programs was proposed as a feasible division of labor between the programmer and the localizing (and parallelizing) compiler. Indeed, the linguistic specification of locality in a general and portable way is a desirable goal for a parallel programming language designer. The extent to which it is effective will determine the mix of latency tolerance (use of parallelism mechanisms) and latency avoidance (use of locality mechanisms) in general-purpose machines.

In another recent High-End Crusader article in HPCwire ("Top500 And HPCC Benchmarks---What They Can And Can't Do" [107896]), your correspondent argued that Linpack is woefully inadequate when it is used as the _only_ benchmark for parallel computers. Essentially, Linpack only gives us performance estimates for programs with artificial locality characteristics. A Linpack result only tells you how well a parallel computer runs a program with an abundance of local and global spatial and temporal locality. Linpack needs to be complemented with other benchmarks that tell you how well programs with totally different locality characteristics run.

This is not to say that no programs have (varying-length) intervals of abundant locality. It _is_ to say that not all programs have nothing but abundant locality throughout their run. The truth is that real programs pass through locality phases. This is the underlying justification for the HPC Challenge benchmark suite, which proposes an empirical theory of locality phases. To repeat (cf. [107896]), the locality phases are the elements in the Cartesian product of: 1) long-range or short-range communication, 2) temporally local or not, and 3) spatially local or not. Linpack lies at the origin of this Cartesian product (short-range, temporally local, spatially local).

It is the _non-Linpack_ benchmarks in the HPC Challenge benchmark suite that make useful distinctions between, for example, the Earth Simulator and Blue Gene/L. Let us roughly characterize these two architectures to decide which one comes closer to being a general-purpose parallel computer.

The Earth Simulator is a cluster of vector SMP nodes that communicate among themselves using MPI. The ES is characterized by an unusually high bandwidth to both local memory _and_ remote memory across the interconnection network. If your correspondent remembers correctly, the bandwidth to local memory is 16 GBs/s, while the inter-node bandwidth is also 16 GBs/s if the MPI overhead is neglected (otherwise, it falls to 12 GBs/s). The vector processors are strongly parallel. Hence, all latency to local memory can be tolerated. In contrast, MPI has a parallelism equal to the message length. Therefore, the ES will compute applications/algorithms superbly well if they have any kind of short-range communication but only if their long-range communication is essentially coarse grained. For coarse-grained long-range communication, the ES has an effective bandwidth equal to its hardware bisection bandwidth, which is possibly 10.24 TBs/s (there are some technical subtleties here but the bisection bandwidth is clearly a lot).

In benchmark terms, the ES would have great performance numbers on Linpack, Stream, local Gups, Ptrans, but not global Gups (again, cf. [107896]). It would be an HPC Challenge winner.

Blue Gene/L was originally conceived for protein-folding simulation. Again, if memory serves, this particular application/algorithm essentially engages in fine-grained short-range communication. By extension, we should ask if Blue Gene/L is _only_ suitable for applications/algorithms with this pattern of accessing memory. Your correspondent suspects that this is the case. Compared to the ES, Blue Gene/L is _not_ a general-purpose parallel computer.

Blue Gene/L is a hierarchically local architecture. Mini single-chip, dual- processor "units" are combined with SDRAM-DDR external-memory chips on compute cards. Sixteen compute cards are grouped in a single node board. Thirty-two node boards are grouped to form a cabinet. Sixty-four cabinets are grouped to form a system.

Blue Gene/L does _not_ have the rich bandwidths of the ES. According to one report, the bandwidth to local memory is 6 GBs/s, while the system bandwidth for MPI communication between node boards is 2 GBs/s. The PowerPC 440GX processors are weakly parallel. Latency won't be a problem for communication within a compute card. Your correspondent suspects that it will be a problem within a node board and a major problem between node boards. This system looks like it will perform superbly only on strongly localized applications. Blue Gene/L, more than most systems, needs a language system with locality specification if it wishes to target more than a narrow locality spectrum of applications.

IBM could report the results for Blue Gene/L on the entire HPC Challenge benchmark suites, as separate performance numbers. In benchmark terms, your correspondent would expect the ES to outperform Blue Gene/L on all remaining benchmarks (he doesn't have all the speeds and feeds, and doesn't know how compute cards communicate). Except very locally, we appear to have both a parallelism deficit and a bandwidth deficit. This is a very preliminary assessment.

One thing is clear: edging out the ES on Linpack doesn't mean much for anything other than strongly localizable applications.

With the Bush administration's failure to fund HECRTF (don't believe statements to the contrary), we are hanging by a thread called DARPA. Some people are smart enough to "get" high-end computing. Some of these people helped defined DARPA's HPCS program, which identified a productivity---not just a performance---crisis in HEC. This involves architectures and languages, among other things.

HECRTF is necessary in order to get university research in high-end computing restarted. With the exception of very few individuals, academic research in high-end computing is a vast wasteland. It would be a bitter irony if satisfaction that an American vendor can squeak past the ES on Linpack (so what?) were to remove the sense of crisis---in part caused by the ES---and allow policy makers to get away with the _national scandal_ of not funding HECRTF, with fresh money. My God, what are peoples' priorities?


The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous. He alone bears responsibility for the opinions expressed in this piece. As with regular articles, replies are welcome and may be sent to HPCwire editor Tim Curns at tim@hpcwire.com.


Top of Page

Previous Article   |  Table of Contents  |