
Features:
HEC ANALYSIS: THE HIGH-END COMPUTING PRODUCTIVITY CRISIS
Commentary from the High-End Crusader
Eventually, some of the most important pillars of homeland defense will come
to depend on innovative parallel computer architectures that have not been
invented yet and that may never be invented within the United States.
Two obvious examples of hard, nearly impossible, computational problems upon
which the survival of this nation may come to depend are: 1) increasingly
sophisticated target discrimination in full-scale national missile defense
(NMD), as systems with a given resolution hand off suspected targets to
systems with higher resolution, and 2) near-real-time manipulation of huge
irregular sparse knowledge-representation graphs in counterterrorism data
mining (CDM), as massive incoming streams of global raw intelligence data are
fused and mined---in a semi-automated fashion---for subtle patterns indicative
of potential foreign terrorists and on-going foreign-terrorist activity.
Most defense analysts see long-range ballistic missiles deployed by rogue---
or simply adversarial---states, and quasi-apocalyptic Sunni Islamic terrorism
of the al-Qaeda variety, as two of the major long-term existential threats to
the United States, both exacerbated by the quasi-controlled proliferation of
weapons of mass destruction.
If the senior technical advisers in this administration viewed the high-end
computing issues bound up in such security concerns as not just important but
_urgent_, they might be less complacent about the time still remaining to
stand up an Integrated High-End Computing Research and Development Initiative.
Given the woefully anemic state of the HEC pipeline, only such an initiative
could enable a steady stream of innovative parallel computer architectures to
come on-line _repeatedly_ as they are required by the nation. The HECRTF final
report, and the recent fine reports that preceded it, are excellent places to
start.
Alas, there is no consensus here. The forward-looking ("progressive") camp
agrees that COTS-based high-performance computing systems are appropriate to
many computational problems, and should be used whenever possible. They
stress, however, that commercial market forces have seriously unbalanced the
field of high-end computing, and urge sustained countervailing federal
investment in Custom-enabled computer systems to rebalance the field. The
backward-looking ("conservative") camp believes that COTS-based systems are,
or will be, sufficient to solve the vast majority of computational problems,
and see no pressing need for fundamental reform. Some of these issues have
been discussed in an earlier HPCwire article [107292, March 26].
High-end conservatives tend to reuse the same counter-argument to deny that
major change is required. Every time a high-end progressive describes some
demanding application that cannot be computed on _any_ of the machines that
pass for high-end systems these days, the conservatives reply, how many people
need a computer that fast? A better question is, how many people need a
computer that fast so much that they are willing to pay that much for it?
Better still, ask the equally important question, how programmable will the
computer be when they get it?
A reader (Rob Schreiber) inadvertently drew attention to programmability
issues when he wrote in recently to inform the readership of the latest fancy
ways to order rows and columns of matrices when using compressed row storage
(CRS) format to perform sparse matrix-vector multiply. The point here is that
it is often unclear, to a non-specialist, what approach might be best for a
given application. Insisting that a programmer be up to date in the latest
mathematical methods required to achieve good performance on cache-based
machines is a good example of the _absence_ of programmability. (A general
treatment of sparse-matrix operations, and their varying needs for random
memory accessing, must wait for another occasion).
Senior (high-end conservative) technical advisers in this administration are
simply camouflaging our nation's HEC slippage when they argue that only a few
niche users require exceptional Custom-enabled parallel architectures. If this
were so, then funding R&D to ensure their steady availability for national
needs wouldn't be a national imperative, would it? High-end progressives must
make the case that almost everybody needs a computer that fast, and that new
applications for computers---which can have a _profound_ impact on the
organizations that compute them---tend to spring into existence when they
become possible on new parallel computing systems with vastly improved
performance and vastly improved programmability.
HEC Productivity Is Sorely Lacking
The raison d'etre for DARPA's High-Productivity Computing Systems (HPCS)
program is the widely shared (and totally accurate) perception that we have
entered a period of decreasing productivity as measured by the increasing
programming skill required, each time a significantly higher-capability
machine becomes available to an organization, for that organization's
programmers to rapidly make a deep impact on the organization's mission, over
an extended period of time, by repeatedly developing new applications for the
new machine.
Whether our task is cracking a cryptographic code, incorporating new physics
into a simulation, or detecting elusive targets, the real output of high-end
computing, i.e., its ultimate value to the mission, is increased insight or
understanding. In all the interesting cases, fast time to insight requires
getting a new application up and running (the programming time), waiting for
it to run (the execution time), and finally interpreting the results (the
interpretation time). Since we are getting machines on a reasonably regular
basis with increased raw arithmetic performance, why is our productivity going
down? Is it something more fundamental than just poor application
performance?
We have productivity problems because the time to program these new machines
is increasing. Porting an old application that runs well on an old machine to
run slightly better on a new but similar machine is not the issue here;
significant mission impact comes from creating fresh applications to compute
new things that have never been computed before, and the time to program and
tune these fresh applications is increasing. The reason is that we don't have
a good solution to the problem of parallel programming. Our systems don't
support good programming abstractions and we don't have a good programming
language to make these abstractions persistent.
Some observers go so far as to report, for the highest-end machines, a
decreasing number of users of these systems, a decreasing number of people
writing programs for them, a decreasing number of new programs written for
them per unit time, and a decreasing number of institutions that care.
All these trends are deeply disturbing. Consider the following. Roughly
speaking, every four to five years we get a new computer that dramatically
increases our computational capability. Often it takes eighteen months to two
years before we start writing new applications for the new machine that have a
real impact on the mission. The delay until the "impact zone" is too long and
and growing longer. The length of the impact zone---the interval during which
major new applications are written---is too short. And, for whatever reasons,
the depth of the impact zone is too shallow.
There is considerable evidence that, as time passes, we require higher and
higher degrees of programming skill to extract proportionate value from each
major architectural advance in high-end systems. It is quite clear that the
HEC productivity crisis goes beyond poor application performance to include an
HEC _programming_ crisis.
Often, the system's poor programmability and unstable performance make the
programmer's job exceedingly difficult. General-purpose architectures should
give predictable performance. A compiler cannot do a good job of generating
code for a system whose performance is hard to model or wildly variable when
perturbed. Certainly the programmer cannot fix this. Sensitivity to slight
changes in memory-access patterns that affect memory contention is perhaps the
most common cause of irregular performance.
The demanding problems of the future require performance and programmability
in equal measure.
A Demanding Defense Problem
One defense application that relies critically on high-performance computing
is comprehensive aerospace vehicle design. For example, this is necessary for
the design of the F-35 Joint Strike Fighter (JSF). This reliance will only
accelerate. Future aerospace development programs will involve hypersonic
capabilities requiring more comprehensive physics models for accurate
simulation in these harsh flight regimes.
Two distinct types of computational science are required. Computational Fluid
Dynamics (CFD) is used in engineering design of complex flow configurations,
including external airflow, and for predicting the interactions of chemistry
with fluid flow for combustion and propulsion. Computational Electromagnetics
(CEM) is used to compute electromagnetic signatures of tactical ground, air,
sea, and space vehicles.
Currently, we have the capability to model the external airflow, propulsion
performance, vehicle signature, and materials properties in vehicle design
with reasonable predictive accuracy on current systems _provided_ these
aspects are computed independently. But what is desired is the ability to
combine these independent modeling efforts into an interactive modeling
capability that would account for the interplay among model components. For
example, engineers could quickly see the effect of proposed changes in the
propulsion design on the vehicle stealth signature. What exceptional
performance and what exceptional programmability are _jointly_ required to to
enable a fine-grained full-airframe combined CFD and CEM simulation of the
Joint Strike Fighter?
Such capabilities are wholly beyond current systems.
Productivity Models And Metrics
DARPA's High-Productivity Computing Systems (HPCS) program has challenged the
high-end community to design a framework of quantifiable productivity metrics
that could be used to assess the ultimate value delivered to an end-user
mission by a particular high-end hardware/software computing system. This
framework (or model) should explain the productivity (or lack thereof) of
existing machines, predict the productivity of future machines, and lead to
strategies that would allow us to design for productivity.
Early on, time to solution (of a computational problem) was recognized as a
key productivity metric. High system productivity can be achieved by first
providing the appropriate computational resources, e.g., bandwidth, and then
designing parallelism and locality mechanisms that allow these resources to be
used efficiently, which minimizes execution time. High human productivity can
be achieved by providing appropriate high-level programming models and
language systems that allow human resources to be used efficiently, which
minimizes programming time.
Are strategies for efficient use of human and system resources the only way to
enhance productivity? Not necessarily. The logo for the annual Salishan
Conference on High-Speed Computing is a triangle whose three sides are
labeled: "Algorithms", "Architecture", and "Language". Does this triangle
contain a productivity clue?
Important top-level productivity factors include: 1) application performance,
i.e., wall-clock time, 2) ease and speed of application construction, i.e.,
efficient use of human capital, 3) efficient use of critical system resources,
and 4) machine availability and reliability. For example, we minimize time to
solution by maximizing both system efficiency (execution time) and programmer
efficiency (programming time).
At first glance, algorithms---unlike, say, programmers and systems---are not
resources that can be used efficiently. It would be pleasing to characterize
the productivity of a hardware/software computing system as the product of 1)
its algorithm productivity, 2) its system productivity, and 3) its human
productivity. The algorithm productivity of a computing system is the degree
to which it permits algorithms of widely-varying algorithmic characteristics;
a highly-productive computing system allows, without performance penalty, the
use of any algorithm that might be appropriate to solve the problem.
The most demanding application/algorithms conceivable are those with frequent
irregular fine-grained long-range communication, frequent fine-grained
synchronization, and load-balancing issues. Full-scale climate simulation
_should_ be solved with an implicit finite-difference scheme, which
necessarily leads to long-range communication. The multi-physics character of
the problem leads to load-balancing issues. If your machine forces you to
"constrain" your climate algorithm to remove both long-range communication and
load balancing, then it is degrading your scientific productivity.
Change the triangle's labels. "Algorithm" becomes "algorithm/object code";
"language" becomes "programming model/language system"; and "architecture"
becomes "architecture/runtime". What interactions among these three
components best enhance productivity?
A good architecture/runtime implements a clean programming model at a fairly
low level of abstraction for the compiler, but also for the heroic "parallel
assembly-language" programmer, giving great flexibility to both. A good
programming language embodies, and a good language system implements, a clean
programming model at a much higher level of abstraction for the programmer who
prefers to "work" the parallelizing compiler hard.
Cray MTA programs are largely sequential loops written without concern for
data distribution. Parallel code is produced for each loop by the compiler.
The compiler generates code to acquire multiple threads and distribute
iterations of the loops to those threads. This makes it easy to write and
tune programs. In the Cray Cascade project, the compiler decomposes programs
into threads to extract parallelism, and shapes operations into threads to
concentrate locality. New programming abstractions (a new programming model)
must be designed to match Cascade's new execution model.
Good object code utilizes the system's hardware resources well and in turn is
executed efficiently by them. How so? Algorithms are either Type T (i.e.,
limited communication and well-balanced workloads) or Type C (i.e., long-range
communication and poorly-balanced workloads). Similarly, systems are either
Type T (i.e., weakly-parallel processors and low-bandwidth system
interconnect) or Type C (i.e., strongly-parallel processors and high-bandwidth
system interconnect). "Parallel" means able to sustain high instruction-issue
bandwidth, including memory-reference issue bandwidth, in the face of all
forms of latency. Processor parallelism (e.g., vectors, threads, streams) is
the only source of memory-reference concurrency in the network/memory
pipeline.
Imagine running Type-C code on a Type-T machine. Starved for data, the
processors stall, using the dominant-cost transistors very poorly. Now,
imagine running Type-T code on a Type-C machine. With little need to
communicate, bandwidth is not required, using the dominant-cost wires very
poorly.
A machine executes an algorithm efficiently when its architecture facilitates
the extraction and exploitation, for greater performance, of all forms and
granularities of parallelism and locality. The architectural features of the
Cray MTA were expressly designed to facilitate decomposition of programs into
large numbers of threads, which in turn are sufficient to fill all of the
MTA's streams (a stream is a hardware resource that executes threads). The
MTA focuses on using the parallelism in a program well. Cascade improves on
executing algorithms efficiently by introducing locality strategies that both
extend standard techniques for using temporal locality well and invent
innovative techniques for using spatial locality well.
A Cautionary (Productivity) Tale
Many people are trying to design better benchmarks for high-end computing.
While there are proposals for programs that engage in different communication
behaviors (short-range or long-range, regular or irregular), we don't have a
good benchmark for synchronization behavior. One suggestion is to use
parallel list ranking. This is a basic problem in list processing that also
arises in other contexts, such as computing tree functions and solving graph-
theoretic problems. The problem is as follows.
We are given a linked list of objects and are asked to compute the distance of
each node from the end of the list.
There is a standard PRAM algorithm for this problem in which one processor is
assigned to each object and all processors access shared memory in lock step.
Each node in the list stores a 'dist' value in which the distance to the end
of the list will be recorded, and a 'next' value that indicates the successor
node. The initial value of 'dist' is 1 in every object except the tail
object, and 0 in the tail object.
The PRAM algorithm works by having each object (processor) "splice out" its
successor object from the list after adding the 'dist' value of the successor
to the 'dist' value of the object. Given unbounded machine parallelism, the
PRAM algorithm ranks the list in logarithmic time.
The correctness invariant is that the sum of the 'dist' values in any sublist
headed by object 'j' is the distance of 'j' from the end of the original list.
The performance invariant is that each pointer-jumping step (iteration)
doubles the number of lists and halves their lengths. If P is the number of
objects, then after log P iterations, each list contains precisely one object.
We need to program this algorithm, or some other list-ranking algorithm, for
an actual parallel machine.
Your correspondent's default programming model is that of the Cray MTA. That
model may be summarized as: 1) an abundance of cheap flexible parallelism, 2)
a flat shared memory, and 3) an abundance of cheap flexible synchronization.
In particular, inexpensive future variables are implemented by full-empty
bits.
Instead of wisely allowing the parallelizing compiler to discover and express
the parallelism that would have been implicit in an appropriately chosen
serial algorithm, your correspondent impetuously decided to take full
responsibility for thread scheduling and resource management. In short, he
decided to program the raw machine in parallel assembly language,
The list is given in the 0-th row of a 2-dimensional log P x P array 'pi' such
that pi[0,j] indicates the parent node of node 'j' in the (degenerate) tree.
Thread 'j' is assigned to list object 'j'. By supposing a notional dummy tail
node 'root', each iteration creates a new rooted directed tree that is twice
as bushy as the previous tree (the lists hang down from 'root').
In the MTA implementation proposed here, 'pi' is an array of future variables,
which are initially undefined. Thread 'j' loops at most log P times, defining
the 'pi' and 'ds' fields of object 'j' in the new tree that results from each
iteration. If the thread responsible for parent(j) in the new tree is slow,
thread 'j' suspends, waiting for the 'pi' and 'ds' values it needs to become
defined. Upon termination, the rank of node 'j' is in the 'dist' variable of
thread 'j'.
Here is the code for thread 'j' (with comments):
'shared' 'synch' pi 'int' 'array'[0..logP,1..P] := undefined if row <> 0;
'shared' ds 'int' 'array'[0..logP,1..P];
'private' tree 'int' := 0; // tree 0 is the original
'private' dist 'int' := ds[0,j]; // list with defined 'pi'
'private' next, next.next 'int'; // and initial 'ds' values
'do' pi[tree,j] <> root --->
next := pi[tree,j] // parent(j) in this tree
next.next := pi[tree,next] // this read may block
dist += ds[tree,next]
ds[tree+1,j] := dist // the new tree is defined
pi[tree+1,j] := next.next // in the next row
tree++
'od'
In both 'pi' and 'ds', each of n = P log P - P + 1 distinct array locations is
written precisely once and read precisely once.
What is the problem? This code is marvelously efficient, indeed, logarithmic,
provided that the number of instruction streams available to support threads
is at least as large as the number of list objects. However, it is woefully
inefficient if the list is much longer than the number of available streams.
In the latter case, it is wiser to divide the list into 's' sublists ('s' is
the number of streams), express the list ranking of each sublist as a serial
algorithm, let the compiler schedule the cyclic reduction of the (temporary)
ranks of the heads of the sublists, and express the final ranking of the nodes
as a serial algorithm. All told, your correspondent didn't get very far in
coding a scalable synchronization benchmark.
The real programming difficulty arose because the processor resource was not
fully virtualized; the programmer (your correspondent) had to tailor the
algorithm as a function of the available stream-level parallelism. In truth,
for the coded algorithm and for very long lists, the stream-level parallelism
in the MTA is insufficiently abundant (we really need the _surabundance_ of
multithreaded lightweight processors in Cascade).
HEC Programming-Language Design Is The Key
We need a full-scale language-design effort to find a good solution to the
problem of parallel programming. We need to find the right programming
abstractions and, to provide continuity to the programmer, tie them together
in a _successful_ higher-level programming language. Parallel programming is
hard for many reasons. One important reason is location sensitivity. This
forces programmers to explicitly manage the mapping of abstract computations
onto physically distributed hardware resources. What is true of locality is
also true of parallelism. Indeed, any resource that is not sufficiently
virtualized will force the programmer to pay attention to that resource for
performance reasons.
Programming languages embody programming models. The key is achieving a
sufficiently high degree of abstraction without suffering performance
degradation. Architectural changes can help with language performance,
especially with communication and synchronization, but we need to find the
right abstractions.
Some of the basic questions are clear. What are the joint responsibilities of
programmer and compiler to specify/generate parallelism? What are the joint
responsibilities of programmer and compiler to specify/generate locality? What
are the appropriate communication and synchronization semantics? What are the
performance characteristics of the language primitives?
Work has begun on the Cascade High-Productivity Language (CHPL). Some of the
reported design criteria are to provide support for explicit parallel
programming, locality-aware programming, and generic programming. There is
both a concrete language that enhances HPF and ZPL, and an abstract language
that supports generic programming. (In part, generic programming is a
technique for generalizing software components so that they can be reused in a
wide variety of situations).
A good language can make parallel programming easier. If we want a bigger
market for high-performance computing, we have to make our high-end systems
easier to use. If we want to address vital national needs, we have to make
our high-end systems more productive. Becoming productive is just a matter of
reversing current negative trends. We want the freedom to choose the
appropriate algorithm for our problem; we want the system to extract the
maximum possible application performance out of each quantum of use of the
system's critical resources; and finally we want parallel programming to be as
high-level and as abstract as sequential programming _ever_ was.
The High-End Crusader, a noted expert in high-performance computing and
communications, wishes to remain anonymous. The opinions expressed here are
his and his alone. All comments are welcome and may be sent to the Editor-in-
Chief, Alan Beck, at alan@tgc.com.
|