HPCwire
 The global publication of record for High Performance Computing / April 16, 2004: Vol. 13, No. 15

  |  Table of Contents  |  

Features:

HEC ANALYSIS: THE HIGH-END COMPUTING PRODUCTIVITY CRISIS
Commentary from the High-End Crusader

Eventually, some of the most important pillars of homeland defense will come to depend on innovative parallel computer architectures that have not been invented yet and that may never be invented within the United States.

Two obvious examples of hard, nearly impossible, computational problems upon which the survival of this nation may come to depend are: 1) increasingly sophisticated target discrimination in full-scale national missile defense (NMD), as systems with a given resolution hand off suspected targets to systems with higher resolution, and 2) near-real-time manipulation of huge irregular sparse knowledge-representation graphs in counterterrorism data mining (CDM), as massive incoming streams of global raw intelligence data are fused and mined---in a semi-automated fashion---for subtle patterns indicative of potential foreign terrorists and on-going foreign-terrorist activity.

Most defense analysts see long-range ballistic missiles deployed by rogue--- or simply adversarial---states, and quasi-apocalyptic Sunni Islamic terrorism of the al-Qaeda variety, as two of the major long-term existential threats to the United States, both exacerbated by the quasi-controlled proliferation of weapons of mass destruction.

If the senior technical advisers in this administration viewed the high-end computing issues bound up in such security concerns as not just important but _urgent_, they might be less complacent about the time still remaining to stand up an Integrated High-End Computing Research and Development Initiative. Given the woefully anemic state of the HEC pipeline, only such an initiative could enable a steady stream of innovative parallel computer architectures to come on-line _repeatedly_ as they are required by the nation. The HECRTF final report, and the recent fine reports that preceded it, are excellent places to start.

Alas, there is no consensus here. The forward-looking ("progressive") camp agrees that COTS-based high-performance computing systems are appropriate to many computational problems, and should be used whenever possible. They stress, however, that commercial market forces have seriously unbalanced the field of high-end computing, and urge sustained countervailing federal investment in Custom-enabled computer systems to rebalance the field. The backward-looking ("conservative") camp believes that COTS-based systems are, or will be, sufficient to solve the vast majority of computational problems, and see no pressing need for fundamental reform. Some of these issues have been discussed in an earlier HPCwire article [107292, March 26].

High-end conservatives tend to reuse the same counter-argument to deny that major change is required. Every time a high-end progressive describes some demanding application that cannot be computed on _any_ of the machines that pass for high-end systems these days, the conservatives reply, how many people need a computer that fast? A better question is, how many people need a computer that fast so much that they are willing to pay that much for it? Better still, ask the equally important question, how programmable will the computer be when they get it?

A reader (Rob Schreiber) inadvertently drew attention to programmability issues when he wrote in recently to inform the readership of the latest fancy ways to order rows and columns of matrices when using compressed row storage (CRS) format to perform sparse matrix-vector multiply. The point here is that it is often unclear, to a non-specialist, what approach might be best for a given application. Insisting that a programmer be up to date in the latest mathematical methods required to achieve good performance on cache-based machines is a good example of the _absence_ of programmability. (A general treatment of sparse-matrix operations, and their varying needs for random memory accessing, must wait for another occasion).

Senior (high-end conservative) technical advisers in this administration are simply camouflaging our nation's HEC slippage when they argue that only a few niche users require exceptional Custom-enabled parallel architectures. If this were so, then funding R&D to ensure their steady availability for national needs wouldn't be a national imperative, would it? High-end progressives must make the case that almost everybody needs a computer that fast, and that new applications for computers---which can have a _profound_ impact on the organizations that compute them---tend to spring into existence when they become possible on new parallel computing systems with vastly improved performance and vastly improved programmability.

HEC Productivity Is Sorely Lacking

The raison d'etre for DARPA's High-Productivity Computing Systems (HPCS) program is the widely shared (and totally accurate) perception that we have entered a period of decreasing productivity as measured by the increasing programming skill required, each time a significantly higher-capability machine becomes available to an organization, for that organization's programmers to rapidly make a deep impact on the organization's mission, over an extended period of time, by repeatedly developing new applications for the new machine.

Whether our task is cracking a cryptographic code, incorporating new physics into a simulation, or detecting elusive targets, the real output of high-end computing, i.e., its ultimate value to the mission, is increased insight or understanding. In all the interesting cases, fast time to insight requires getting a new application up and running (the programming time), waiting for it to run (the execution time), and finally interpreting the results (the interpretation time). Since we are getting machines on a reasonably regular basis with increased raw arithmetic performance, why is our productivity going down? Is it something more fundamental than just poor application performance?

We have productivity problems because the time to program these new machines is increasing. Porting an old application that runs well on an old machine to run slightly better on a new but similar machine is not the issue here; significant mission impact comes from creating fresh applications to compute new things that have never been computed before, and the time to program and tune these fresh applications is increasing. The reason is that we don't have a good solution to the problem of parallel programming. Our systems don't support good programming abstractions and we don't have a good programming language to make these abstractions persistent.

Some observers go so far as to report, for the highest-end machines, a decreasing number of users of these systems, a decreasing number of people writing programs for them, a decreasing number of new programs written for them per unit time, and a decreasing number of institutions that care.

All these trends are deeply disturbing. Consider the following. Roughly speaking, every four to five years we get a new computer that dramatically increases our computational capability. Often it takes eighteen months to two years before we start writing new applications for the new machine that have a real impact on the mission. The delay until the "impact zone" is too long and and growing longer. The length of the impact zone---the interval during which major new applications are written---is too short. And, for whatever reasons, the depth of the impact zone is too shallow.

There is considerable evidence that, as time passes, we require higher and higher degrees of programming skill to extract proportionate value from each major architectural advance in high-end systems. It is quite clear that the HEC productivity crisis goes beyond poor application performance to include an HEC _programming_ crisis.

Often, the system's poor programmability and unstable performance make the programmer's job exceedingly difficult. General-purpose architectures should give predictable performance. A compiler cannot do a good job of generating code for a system whose performance is hard to model or wildly variable when perturbed. Certainly the programmer cannot fix this. Sensitivity to slight changes in memory-access patterns that affect memory contention is perhaps the most common cause of irregular performance.

The demanding problems of the future require performance and programmability in equal measure.

A Demanding Defense Problem

One defense application that relies critically on high-performance computing is comprehensive aerospace vehicle design. For example, this is necessary for the design of the F-35 Joint Strike Fighter (JSF). This reliance will only accelerate. Future aerospace development programs will involve hypersonic capabilities requiring more comprehensive physics models for accurate simulation in these harsh flight regimes.

Two distinct types of computational science are required. Computational Fluid Dynamics (CFD) is used in engineering design of complex flow configurations, including external airflow, and for predicting the interactions of chemistry with fluid flow for combustion and propulsion. Computational Electromagnetics (CEM) is used to compute electromagnetic signatures of tactical ground, air, sea, and space vehicles.

Currently, we have the capability to model the external airflow, propulsion performance, vehicle signature, and materials properties in vehicle design with reasonable predictive accuracy on current systems _provided_ these aspects are computed independently. But what is desired is the ability to combine these independent modeling efforts into an interactive modeling capability that would account for the interplay among model components. For example, engineers could quickly see the effect of proposed changes in the propulsion design on the vehicle stealth signature. What exceptional performance and what exceptional programmability are _jointly_ required to to enable a fine-grained full-airframe combined CFD and CEM simulation of the Joint Strike Fighter?

Such capabilities are wholly beyond current systems.

Productivity Models And Metrics

DARPA's High-Productivity Computing Systems (HPCS) program has challenged the high-end community to design a framework of quantifiable productivity metrics that could be used to assess the ultimate value delivered to an end-user mission by a particular high-end hardware/software computing system. This framework (or model) should explain the productivity (or lack thereof) of existing machines, predict the productivity of future machines, and lead to strategies that would allow us to design for productivity.

Early on, time to solution (of a computational problem) was recognized as a key productivity metric. High system productivity can be achieved by first providing the appropriate computational resources, e.g., bandwidth, and then designing parallelism and locality mechanisms that allow these resources to be used efficiently, which minimizes execution time. High human productivity can be achieved by providing appropriate high-level programming models and language systems that allow human resources to be used efficiently, which minimizes programming time.

Are strategies for efficient use of human and system resources the only way to enhance productivity? Not necessarily. The logo for the annual Salishan Conference on High-Speed Computing is a triangle whose three sides are labeled: "Algorithms", "Architecture", and "Language". Does this triangle contain a productivity clue?

Important top-level productivity factors include: 1) application performance, i.e., wall-clock time, 2) ease and speed of application construction, i.e., efficient use of human capital, 3) efficient use of critical system resources, and 4) machine availability and reliability. For example, we minimize time to solution by maximizing both system efficiency (execution time) and programmer efficiency (programming time).

At first glance, algorithms---unlike, say, programmers and systems---are not resources that can be used efficiently. It would be pleasing to characterize the productivity of a hardware/software computing system as the product of 1) its algorithm productivity, 2) its system productivity, and 3) its human productivity. The algorithm productivity of a computing system is the degree to which it permits algorithms of widely-varying algorithmic characteristics; a highly-productive computing system allows, without performance penalty, the use of any algorithm that might be appropriate to solve the problem.

The most demanding application/algorithms conceivable are those with frequent irregular fine-grained long-range communication, frequent fine-grained synchronization, and load-balancing issues. Full-scale climate simulation _should_ be solved with an implicit finite-difference scheme, which necessarily leads to long-range communication. The multi-physics character of the problem leads to load-balancing issues. If your machine forces you to "constrain" your climate algorithm to remove both long-range communication and load balancing, then it is degrading your scientific productivity.

Change the triangle's labels. "Algorithm" becomes "algorithm/object code"; "language" becomes "programming model/language system"; and "architecture" becomes "architecture/runtime". What interactions among these three components best enhance productivity?

A good architecture/runtime implements a clean programming model at a fairly low level of abstraction for the compiler, but also for the heroic "parallel assembly-language" programmer, giving great flexibility to both. A good programming language embodies, and a good language system implements, a clean programming model at a much higher level of abstraction for the programmer who prefers to "work" the parallelizing compiler hard.

Cray MTA programs are largely sequential loops written without concern for data distribution. Parallel code is produced for each loop by the compiler. The compiler generates code to acquire multiple threads and distribute iterations of the loops to those threads. This makes it easy to write and tune programs. In the Cray Cascade project, the compiler decomposes programs into threads to extract parallelism, and shapes operations into threads to concentrate locality. New programming abstractions (a new programming model) must be designed to match Cascade's new execution model.

Good object code utilizes the system's hardware resources well and in turn is executed efficiently by them. How so? Algorithms are either Type T (i.e., limited communication and well-balanced workloads) or Type C (i.e., long-range communication and poorly-balanced workloads). Similarly, systems are either Type T (i.e., weakly-parallel processors and low-bandwidth system interconnect) or Type C (i.e., strongly-parallel processors and high-bandwidth system interconnect). "Parallel" means able to sustain high instruction-issue bandwidth, including memory-reference issue bandwidth, in the face of all forms of latency. Processor parallelism (e.g., vectors, threads, streams) is the only source of memory-reference concurrency in the network/memory pipeline.

Imagine running Type-C code on a Type-T machine. Starved for data, the processors stall, using the dominant-cost transistors very poorly. Now, imagine running Type-T code on a Type-C machine. With little need to communicate, bandwidth is not required, using the dominant-cost wires very poorly.

A machine executes an algorithm efficiently when its architecture facilitates the extraction and exploitation, for greater performance, of all forms and granularities of parallelism and locality. The architectural features of the Cray MTA were expressly designed to facilitate decomposition of programs into large numbers of threads, which in turn are sufficient to fill all of the MTA's streams (a stream is a hardware resource that executes threads). The MTA focuses on using the parallelism in a program well. Cascade improves on executing algorithms efficiently by introducing locality strategies that both extend standard techniques for using temporal locality well and invent innovative techniques for using spatial locality well.

A Cautionary (Productivity) Tale

Many people are trying to design better benchmarks for high-end computing. While there are proposals for programs that engage in different communication behaviors (short-range or long-range, regular or irregular), we don't have a good benchmark for synchronization behavior. One suggestion is to use parallel list ranking. This is a basic problem in list processing that also arises in other contexts, such as computing tree functions and solving graph- theoretic problems. The problem is as follows.

We are given a linked list of objects and are asked to compute the distance of each node from the end of the list.

There is a standard PRAM algorithm for this problem in which one processor is assigned to each object and all processors access shared memory in lock step. Each node in the list stores a 'dist' value in which the distance to the end of the list will be recorded, and a 'next' value that indicates the successor node. The initial value of 'dist' is 1 in every object except the tail object, and 0 in the tail object.

The PRAM algorithm works by having each object (processor) "splice out" its successor object from the list after adding the 'dist' value of the successor to the 'dist' value of the object. Given unbounded machine parallelism, the PRAM algorithm ranks the list in logarithmic time.

The correctness invariant is that the sum of the 'dist' values in any sublist headed by object 'j' is the distance of 'j' from the end of the original list. The performance invariant is that each pointer-jumping step (iteration) doubles the number of lists and halves their lengths. If P is the number of objects, then after log P iterations, each list contains precisely one object.

We need to program this algorithm, or some other list-ranking algorithm, for an actual parallel machine.

Your correspondent's default programming model is that of the Cray MTA. That model may be summarized as: 1) an abundance of cheap flexible parallelism, 2) a flat shared memory, and 3) an abundance of cheap flexible synchronization. In particular, inexpensive future variables are implemented by full-empty bits.

Instead of wisely allowing the parallelizing compiler to discover and express the parallelism that would have been implicit in an appropriately chosen serial algorithm, your correspondent impetuously decided to take full responsibility for thread scheduling and resource management. In short, he decided to program the raw machine in parallel assembly language,

The list is given in the 0-th row of a 2-dimensional log P x P array 'pi' such that pi[0,j] indicates the parent node of node 'j' in the (degenerate) tree. Thread 'j' is assigned to list object 'j'. By supposing a notional dummy tail node 'root', each iteration creates a new rooted directed tree that is twice as bushy as the previous tree (the lists hang down from 'root').

In the MTA implementation proposed here, 'pi' is an array of future variables, which are initially undefined. Thread 'j' loops at most log P times, defining the 'pi' and 'ds' fields of object 'j' in the new tree that results from each iteration. If the thread responsible for parent(j) in the new tree is slow, thread 'j' suspends, waiting for the 'pi' and 'ds' values it needs to become defined. Upon termination, the rank of node 'j' is in the 'dist' variable of thread 'j'.

Here is the code for thread 'j' (with comments):

   'shared' 'synch' pi 'int' 'array'[0..logP,1..P] := undefined if row <> 0;
   'shared'         ds 'int' 'array'[0..logP,1..P];
   'private'        tree 'int' := 0;               // tree 0 is the original
   'private'        dist 'int' := ds[0,j];         // list with defined 'pi'
   'private'        next, next.next 'int';         // and initial 'ds' values

   'do' pi[tree,j] <> root --->
      next         := pi[tree,j]                   // parent(j) in this tree
      next.next    := pi[tree,next]                // this read may block
      dist         += ds[tree,next]
      ds[tree+1,j] := dist                         // the new tree is defined
      pi[tree+1,j] := next.next                    // in the next row
      tree++
   'od'

In both 'pi' and 'ds', each of n = P log P - P + 1 distinct array locations is written precisely once and read precisely once.

What is the problem? This code is marvelously efficient, indeed, logarithmic, provided that the number of instruction streams available to support threads is at least as large as the number of list objects. However, it is woefully inefficient if the list is much longer than the number of available streams. In the latter case, it is wiser to divide the list into 's' sublists ('s' is the number of streams), express the list ranking of each sublist as a serial algorithm, let the compiler schedule the cyclic reduction of the (temporary) ranks of the heads of the sublists, and express the final ranking of the nodes as a serial algorithm. All told, your correspondent didn't get very far in coding a scalable synchronization benchmark.

The real programming difficulty arose because the processor resource was not fully virtualized; the programmer (your correspondent) had to tailor the algorithm as a function of the available stream-level parallelism. In truth, for the coded algorithm and for very long lists, the stream-level parallelism in the MTA is insufficiently abundant (we really need the _surabundance_ of multithreaded lightweight processors in Cascade).

HEC Programming-Language Design Is The Key

We need a full-scale language-design effort to find a good solution to the problem of parallel programming. We need to find the right programming abstractions and, to provide continuity to the programmer, tie them together in a _successful_ higher-level programming language. Parallel programming is hard for many reasons. One important reason is location sensitivity. This forces programmers to explicitly manage the mapping of abstract computations onto physically distributed hardware resources. What is true of locality is also true of parallelism. Indeed, any resource that is not sufficiently virtualized will force the programmer to pay attention to that resource for performance reasons.

Programming languages embody programming models. The key is achieving a sufficiently high degree of abstraction without suffering performance degradation. Architectural changes can help with language performance, especially with communication and synchronization, but we need to find the right abstractions.

Some of the basic questions are clear. What are the joint responsibilities of programmer and compiler to specify/generate parallelism? What are the joint responsibilities of programmer and compiler to specify/generate locality? What are the appropriate communication and synchronization semantics? What are the performance characteristics of the language primitives?

Work has begun on the Cascade High-Productivity Language (CHPL). Some of the reported design criteria are to provide support for explicit parallel programming, locality-aware programming, and generic programming. There is both a concrete language that enhances HPF and ZPL, and an abstract language that supports generic programming. (In part, generic programming is a technique for generalizing software components so that they can be reused in a wide variety of situations).

A good language can make parallel programming easier. If we want a bigger market for high-performance computing, we have to make our high-end systems easier to use. If we want to address vital national needs, we have to make our high-end systems more productive. Becoming productive is just a matter of reversing current negative trends. We want the freedom to choose the appropriate algorithm for our problem; we want the system to extract the maximum possible application performance out of each quantum of use of the system's critical resources; and finally we want parallel programming to be as high-level and as abstract as sequential programming _ever_ was.


The High-End Crusader, a noted expert in high-performance computing and communications, wishes to remain anonymous. The opinions expressed here are his and his alone. All comments are welcome and may be sent to the Editor-in- Chief, Alan Beck, at alan@tgc.com.


Top of Page

  |  Table of Contents  |