HPCwire
 The global publication of record for High Performance Computing / June 25, 2004: Vol. 13, No. 25

  |  Table of Contents  |  

Features:

LETTERS TO THE EDITOR: RESPONDING TO THE HIGH-END CRUSADER

In an article entitled, "REVITALIZING HECRTF: A FOCUSED PLAN FOR HIGH-END COMPUTING", the "High-End Crusader" provides us with a rambling, but generally insightful, monologue concerning recent efforts to generate renewed federal support for HPC.

Although the article's explicit messages are generally reasonable (ignoring the obvious biases toward one particular vendor's product plans), there are some implicit messages that are not so reasonable and that can cause confusion.

The biggest confusion is associated with the implication throughout the article that "high performance" applications are necessarily "high bandwidth" applications. This is an incorrect, though widely held, belief.

The article also asserts that: "[t]he bulk of the high-performance market, perhaps 98%, is occupied by mid-range servers that have been optimized to suit commercial applications such as transaction processing and information retrieval." Both the percentage and the assumption about the vendors' optimization strategies are incorrect here, but it is the latter mistake that is most misleading.

The combination of the incorrect assertion about bandwidth requirements and the incorrect assertion about vendor optimization strategies could easily lead one to believe that computer vendors are stubbornly refusing to build the high-bandwidth systems that HPC customers want and that if the vendors would just provide more bandwidth then performance would dramatically increase and everyone would be happy.

As appealing as that idea is to the conspiracy theorist in me, it is not even close to being true.

As I have shown in many venues over the last years (e.g., http://www.cs.virginia.edu/~mccalpin/wwc-keynote.html), significant data exists to show that the majority of HPC applications are limited in performance by computation and/or cache bandwidth, not by memory bandwidth, when run on microprocessor-based systems with reasonably large caches. Some of the examples are well known (e.g., most computational chemistry, most seismic data processing), while others are more surprising (weather and climate models, automotive crash simulations, several important nuclear stockpile stewardship applications).

Once this is understood, the behavior of computer vendors can be seen as rational -- vendors tend to provide products that customers choose to purchase. The reason that most servers sold for HPC applications have modest sustainable memory and interconnect bandwidth is not the result of some secret conspiracy, but is rather the result of the "invisible hand" of the free market. The systems that get purchased tend to be the ones that optimize price/performance for the customers' applications. These are indeed "balanced" systems, but the "right" balance between cost and performance is determined by the aggregate market buying behavior of HPC customers, not by oversimplified theoretical analyses.

(Perhaps most HPCwire readers are not aware that many vendors have offered versions of their systems with increased memory bandwidth per processor, but in every case that I am aware of these have been unsuccessful even in HPC. Many of these "higher-bandwidth" versions have been cancelled shortly after their introductions due to a lack of customer interest.)

Over the last 15 years, the "balance" that has been most successful in the overall HPC market (for servers with cached/hierarchical memories) has corresponded to a sustainable memory bandwidth of between 1 Byte/FLOP (more precisely, 1 Byte/second per peak FP operation/second), and 0.5 Bytes/FLOP, with a slow, but apparently significant trend toward the lower values (less bandwidth per peak FLOP). Significantly less data is available for interconnect bandwidth requirements. For MPI-based applications, the required interconnect bandwidth appears to be about an order of magnitude less than the local memory bandwidth, but it seems likely that a large fraction of this "bandwidth reduction" is associated with the programmer effort required to port the application to MPI, so this ratio may not apply to programming models based on global namespaces (and requiring significantly less porting effort). Even less data exists that would allow us to quantitatively investigate second-order issues, such as locality of communication between nodes for HPC applications run on large clusters.

There remain, of course, application areas in HPC for which the most desirable algorithms require high memory bandwidth and/or high interconnect bandwidth on current architectures. (I developed the STREAM benchmark because my application area of large-scale ocean circulation modelling had this characteristic.) As correctly stated in the article, the problem is that these application areas are not associated with sufficient "buying power" to create a self-sustaining market in high-bandwidth computers.

There are fundamental changes in the optimum balance of systems at extreme scales (e.g., >10,000 processors). Most algorithms for "Grand Challenge" problems are characterized by scaling laws that lead to more node-to-node communication per unit of computation as the system size grows (given the typical "fixed time to solution" scaling methodology). The communications also tend toward shorter message lengths and if the algorithm includes any collective operations, the number of these operations per unit of computation grows very rapidly. In line with Stalin's famous aphorism, there comes a point when these quantitative differences in system balance become qualitative.

These trends are clear in recent experience with clusters delivering sustained performance in the TFLOPS range, and the trends are expected to continue for larger system sizes (especially if semiconductor technology fails to continue its recent rapid rate of advance, so that performance increases must be obtained primarily from increasing the number of processors cooperating on each job). On the other hand, it does not appear that local memory bandwidth requirements increase in these extreme scale systems. Fixed time to solution scaling leads to smaller data sets per node, which are often slightly more cache-friendly. A more important long term trend is that increasingly complex models tend to be increasingly cache-friendly, since they require more computation per unit of data.

These scaling laws lead to a divergence between the "optimal" design points for modest-scale parallelism (<$1M system price) and extreme-scale parallelism (>>$10M system price) that makes HPC system design very challenging. Note that only about 5% of HPC server revenue is associated with the >$10M systems, and even that small of money is associated with a rather wide variety of application performance profiles.

DISCLAIMERS: I currently work for, but do not speak for, IBM. At IBM, I am the Large Scale System Architecture team leader for the IBM project (PERCS) funded under the DARPA HPCS program, and am also one of the lead performance analysts for HPC applications in the IBM POWER microprocessor development group.

John D. McCalpin, Ph.D.
john@mccalpin.com "Dr. Bandwidth"
http://www.streambench.org


Top of Page

  |  Table of Contents  |