HPCwire
 The global publication of record for High Performance Computing / October 8, 2004: Vol. 13, No. 40

Previous Article   |  Table of Contents  |  

Features:

HIGH-END CRUSADER REFUTES BLUE GENE/L DEFENDER
Commentary from the High-End Crusader

An anonymous respondent, possibly an employee of the U.S. government with a special interest in Blue Gene/L, this week criticizes HEC's (short) article "High-End Crusader Questions IBM's 'Fastest' Computer Claim" [108466], which appeared in last week's HPCwire. For memory, the thesis of that article was that Blue Gene/Light is a special-purpose parallel computer that is only suited to applications that persistently inhabit a small portion of the 3-dimensional locality "phase space", in contrast to other applications that sweep out vast swaths of locality phase space during the same or different computations controlled by the same program.

The issues raised by the respondent, here dubbed "AR", touch on some of the central myths of the conventional wisdom on high-end computing. Since these myths play a major role in _enabling_ the continuing productivity crisis in high-end computing, their rational examination is of the highest importance.

AR begins by pointing out that HEC did not comment on the advanced packaging technology that allows Blue Gene/L to deliver the same peak arithmetic performance for significantly less cost, less power, and less space---in comparison to some other high-performance computers (the example we both used was the Japanese Earth Simulator). This is true. Working against a tight deadline, HEC did not discuss packaging at all, clearly a significant and praiseworthy achievement by the Blue Gene engineers. But the argument from cost is disingenuous, to say the least.

The consistent framework for HEC's argumentation over the last decade has been that there are _two_ distinct kinds of high-performance computer system. On the one hand, there are strongly parallel, high-bandwidth systems (capability systems, called "Type C" systems by Burton Smith), whose dominant cost lies in their (expensive) system interconnect. On the other hand, there are weakly parallel, low-bandwidth systems (capacity systems, called "Type T" systems by Burton Smith), whose dominant cost lies in their (cheap) transistors---in processors and memory. ("C" is for [inter]connect; "T" is for transistor).

Parallelism strength is determined by the number and kind of parallelism mechanisms in the processor. Such mechanisms are used to tolerate memory, network, and synchronization latencies, among others. While a Type-T system is sufficient to achieve a respectable Linpack R_max value, only a Type-C system can achieve a respectable Gups number or respectable performance on sparse matrix-vector multiply (in the general case).

Of course, the latest thinking about Type-C systems is that they should incorporate diverse, compatible locality mechanisms in order to complement their (strong) parallelism mechanisms. The new Type-C slogans are: "No parallelism left behind!" and "No locality left behind!".

The first punchline of the rebuttal, therefore, is that achieving Linpack performance is inexpensive: if you need only floating-point performance and very little in the way of memory access, communication, or synchronization, then you don't need to spend money on the routers, wires, connectors, flex circuits, etc., etc., that are necessary to move data around, not just compute rapidly with them if they happen to be close at hand.

In "Top500 And HPCC Benchmarks---What They Can And Can't Do" [107896]), HEC proposed an empirical theory of program-locality behavior. Locality "phase space" was construed as consisting of the 3-dimensional Cartesian product of: 1) short-range or long-range communication, 2) temporally local or not, and 3) spatially locally or not. (Analysis shows that not all eight phases are distinct). The idea was that, during a computation, programs migrate from one locality phase to another---in essence, making locality phase transitions--- very possibly tracing out different phase-space trajectories on different runs of the same program because of different inputs. The requirement was that parallel computation needs to be efficient in _every_ region of locality phase space.

Now, if someone wants to build a computer that specializes in one small corner of this phase space, this is fine---as long as (spurious) claims of having achieved general-purpose parallel computing do not damage big science, big national security, or big industry (such as energy or aerospace) by misleading people into thinking that they now have the mix of computers they need. There is nothing wrong with special-purpose parallel computers. But they can't replace general-purpose parallel computers, which---by definition---compute with high parallel efficiency throughout _all_ of locality phase space.

Used as the only benchmark, Linpack is largely uninformative because it is representative only of programs that lie at the origin of this locality phase space (Cartesian product), namely, in locality phase (short-range, temporally local, spatially local).

First Myth

Having warmed up, AR launches into defending a truly pernicious myth: that price/performance is a universably reasonable guideline for procuring high-end systems. This is flat wrong: price/performance can be justified only if the project is time insensitive, i.e., if the utility function that measures the value delivered to the project as a function of the elapsed time to solution is a constant independent of time.

But are utility functions constant? Certainly not, in general. Often a national-security utility function will be a step function, constant for 'x' amount of time and zero after that. In any situation where there is great value in knowing the solution as soon as possible (perhaps because expensive resources are sitting idle), we expect the utility function to monotonically decrease over time.

Intuitively, if time means nothing to you, then, by all means, buy a cheap computer. But if time means something to you, perhaps you should give some thought to absolute performance. Indeed, when time to solution matters, you should buy a capability computer. How much capability is reasonable to buy is determined by precisely _how_ important time to solution is to you.

More often than not, price/performance is the wrong procurement metric. What is needed is the ratio of the utility function to the cost function, as will now be explained. This rejoins some work done by Marc Snir for the PERCS project---funded by DARPA's HPCS program---when Snir was still at IBM.

Let us fix the application/algorithm 'A' and allow the system 'S' to vary. Determined by system 'S' (considering both programming and execution times), there will be a _given_ time to solution, denoted tts_S, required to compute application/algorithm 'A'. Let u(t) be the utility function and let c(t) be the cost function (for example, total cost of ownership from procurement to solution). By taking the ratio u(tts_S)/c(tts_S), we get some measure of the return on investment of buying and then programming system 'S' to compute application/algorithm 'A' relative to utility function u(t).

Note that letting system 'S' compute longer just decreases u(t) while increasing c(t)---because of the extra runtime---thus lowering the return on investment. The quantity tts_S is thus the _minimal_ time to solution for application 'A' on system 'S'. By maximizing the ratio u(tts_S)/c(tts_S) over all systems 'S', you maximize your return on investment.

One might think that being guided by the utility/cost ratio is mandated by high-end problems only. This would be a mistake. Utility/cost ratios should replace price/performance metrics even in time-sensitive mid-range projects. However, it is true that time-dependent utility functions with wide variations are more _visible_ in high-end projects.

The astute reader may ask if this utility/cost productivity metric assumes that we only run one application/algorithm 'A' on system 'S'. Not really. The general productivity crisis in high-end computing is that we have entered a period of decreasing productivity as measured by the increasing programming skill required, each time a significantly higher-capability machine becomes available to an organization, for that organization's programmers to rapidly make a deep impact on the organization's mission, over an extended period of time, by repeatedly developing new applications for the new machine. The utility/cost metric can easily be generalized to handle this case.

Again, price/performance _is_ the appropriate metric when the project is time insensitive. The logical flavor of all these myths is to assume that the particular subsumes the general.

A Dangerous Rival

AR then attacks the HPC Challenge benchmark suite itself, which is apparently a dangerous rival to the continued use of the Linpack benchmark alone, by claiming that the reporting procedures for HPCC results haven't been properly institutionalized. To be polite, your correspondent doesn't grasp what AR is driving at. In some sense, it is even beside the point: if the HPC Challenge benchmark suite didn't exist (or wasn't properly institutionalized), then we would have to invent it (or fix it up).

Benchmarks are used to estimate the performance of real-world applications on new computer systems. Arithmetic performance is paced by the rate at which data operands can be supplied to the processor. In turn, this rate is determined by the system, surely, but also by the algorithm, that is, by: 1) whether communication is short-range or long-range, 2) whether there is temporal locality or not, and 3) whether there is spatial locality or not. For example, is the program currently engaged in unit-stride or random memory accessing? Two questions are key. What is its current locality phase? What is its current computational rate?

These differences in memory-accessing patterns (cf. [107896]) mean that there are particular quantities of different types of work that are performed by the system at different computational rates. We can estimate the quantities of these different types of work by analyzing our application, but we have no idea of what the computational rate for performing work of type 'abc' on system 'S' is---unless we have a benchmark that measures how fast system 'S' performs work of type 'abc'. Multiple benchmark numbers are required precisely because programs are heterogeneous in their computational rates. In simple English, programs undergo transitions from one computational rate to another as the computation unfolds; a suite of carefully designed benchmarks is required to cover (provide information about) the main computational rates in order to estimate the performance of the entire program.

Linpack is representative of only _one_ of those computational rates, namely, the one that corresponds to extremely high local and global spatial and temporal locality. It is not that Linpack is embarrassingly parallel; rather, it is that the dense linear-algebra Linpack is embarrassingly localizable or, more precisely, _embarrassingly local_ after years of tuning.

Second Myth

Having hit his stride, AR introduces another pernicious myth, namely, that over time, all (or most) applications migrate to the origin of the locality phase space as the result of heroic efforts by algorithm designers. In "HEC Analysis: Who Needs High-Bandwidth Applications?" [108052], HEC studied ---starting from first principles---the prospects for success of the on-going efforts to "retool codes in application areas such as stockpile stewardship, global climate modeling, computational fluid dynamics, local and regional weather forecasting, aircraft design, cosmology, biomolecular simulation, materials science, and quantum chemistry to run more efficiently on MPP architectures", as the HECRTF report puts it (p. 63).

HEC offered a canonical high-bandwidth application that could not be retooled without loss of either resolution or acceptable runtime. To quote HEC:

"The following application structure occurs in at least two application areas: 1) climate simulation, and 2) high-resolution 3-D hydrodynamics with radiation transfer ('radiation hydro').

Imagine a simulation problem with a vast spectrum of space and time scales. For reasons of numerical stability, simulation of such a _stiff_ physical system may require an implicit finite-difference scheme. Convergence may demand such an approach; often, the reason is merely computational feasibilty.

Long-range communication will occur if the resulting time steps are big (the reason we chose the implicit scheme in the first place). Implicit methods for nonlinear PDEs solve a sparse linear system several times per time step. Often, the sparse matrix is not very well-conditioned. The intensive use ---once for each of _many_ linear systems---of iterative solvers with embedded sparse matrix-vector multiplies makes this simulation application massively bandwidth intensive.

In radiation hydro, the slow half of the calculation is the hydrodynamics calculation, which iterates over big time steps. The fast half is the radiosity calculation, which requires the solution of systems of nonlinear equations using Newton's method. This is iterated over small time steps. The Newton corrections are found by solving sparse systems of linear equations. Each big time step thus requires multiple invocations of a linear-system solver. These solvers are themselves iterative. The end result is that radiation hydro codes are massively bandwidth intensive.

System stiffness is one attribute that leads to high-bandwidth problems; a second attribute with the same result is system heterogeneity. For heterogeneous, e.g., multiphysics, problems, load imbalance is a serious issue. Irregular, adaptive meshes are another source of imbalance. To simplify, we could say that _stiffness_ and _adaptivity_ are the two primary sources of high-bandwidth computational fluid dynamics problems (more generally, problems of scientific simulation that depend on partial differential equations)".

Your correspondent does not deny that amazing algorithmic progress has occurred---in some instances. Realistically, though, what are the chances that there is a general trend of algorithm changes that localize applications? There is an extensive history of that not happening. Hard experience gained over a decade or more has made it abundantly clear that there are many applications that simply do not migrate to the desired small corner of locality phase space. These are the Type-C applications, characterized by significant fine-grained long-range communication, poorly balanced workloads, sparse linear algebra, implicit methods, irregular, adaptive meshes, and in which the programming effort is _not_ amortizable.

An Outrageous Suggestion

Finally, having implicitly granted that high-bandwidth systems may be easier to program, AR makes the truly outrageous suggestion that it may be a good thing that our high-performance systems are so burdensome to program. This is ludicrous and wrong and pernicious. This attitude is more than enough to guarantee that general-purpose parallel computing will never enter the computing mainstream. Indeed, it is inimical to the evolution that parallel computing must undergo if computing as a whole hopes to ever tackle those Grand Challenge problems that were once our goal but which have been quietly de-emphasized as we repeatedly fail to solve them. In part, the Grand Challenge problems are high-bandwidth problems that cannot be computed on low-bandwidth systems.

AR's language is mind boggling: "Have we applied the right computer science to these applications? ... The debate then is whether public funds should be used to train better computer scientists who can move these applications down to run on cheap machines, or whether it should go to build expensive machines that can accomodate inexperienced programmers and tolerate sloppy algorithms/programming?".

Gifted computational scientists with solid experience at the national labs, and exposure to several instances of both high-bandwidth and low-bandwidth systems (often implementing the same project on widely diverse machines), have reported their intense frustration in spending months on performance tuning for low-bandwidth machines with very mediocre performance improvement. Again, the productivity crisis in high-end computing is that, as time passes, we require higher and higher degrees of programming skill---present in smaller and smaller fractions of our elite workforce---to extract proportionate value from each major architectural advance in high-end systems.

Significant mission impact comes from creating fresh applications to compute new things that have never been computed before, and the time to program and tune these fresh applications is increasing. The reason is that we don't have a good solution to the problem of parallel programming. Our systems don't support good programming abstractions and we don't have a good programming language to make these abstractions persistent.

Some observers go so far as to report, for the highest-end machines, a decreasing number of users of these systems, a decreasing number of people writing programs for them, a decreasing number of new programs written for them per unit time, and a decreasing number of institutions that care.

We all agree that the true high-end market has shrunk. The applications are out there, but by and large users have abandoned their attempts to program them. Algorithms are becoming more sophisticated. The chief result of sophistication is increased complexity. No programmer can deal with this complexity unless architectures and language systems reduce the burdens of parallel programming to the point where skilled computational scientists and software engineers have at least a fighting chance to rapidly develop fresh applications that provide deep mission impact---in part because it is not a sisyphean task (cf. Greek mythology) to achieve acceptable performance.

It is not our sloppy programmers who are complaining (or leaving the field); it is our _best_ programmers.

Conclusion

There are two consistent and diametrically opposed worldviews here. The simplest distinction is that one view says that computing needs _both_ Type-T _and_ Type-C systems, while the other view hopes to get by with Type-T systems alone. The other divergences follow. 1) One view tries to complement Linpack with the HPC Challenge benchmark suite; the other view attacks HPCC. 2) One view introduces utility/cost as a productivity metric; the other view insists we stick with price/performance. 3) One view stresses the difficulty of localizing demanding applications (considering locality in all its forms, including load balancing); the other view claims that all (or most) applications eventually localize, if only we try hard enough. 4) One view bemoans the difficulty of parallel programming; the other view exults in it.

Faced with such a comprehensive (and misguided) ideology, your correspondent almost doesn't know where to begin. He only says: Look at the record; history is the best guide.


The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous. He alone bears responsibility for the opinions expressed in this article. For all articles, replies are welcome and may be sent to HPCwire editor Tim Curns at tim@hpcwire.com.


Top of Page

Previous Article   |  Table of Contents  |