
Features:
HIGH-END COMPUTING NEEDS RADICAL PROGRAMMING CHANGE
Commentary from the High-End Crusader
The concern behind DARPA's High-Productivity Computing Systems (HPCS) program
has always been the unacceptably high total cost of ownership, for the value
delivered, of the high-end systems required in the national-security and
industrial-user communities. Typical applications here are theater weather
prediction, materials science, and aerospace vehicle design. Ironically, the
greatest impact of the HPCS program could eventually be on computing as a
whole, which would be the case precisely if general-purpose parallel computing
could somehow be brought into the mainstream of computing as a whole.
The HPCS program focuses on the performance, programmability, portability,
robustness, and ultimately productivity of future high-end computing systems.
To counter excessive hardware and software costs, this program targets new
machines that integrate innovative computer architectures and new programming
language systems to foster radically more efficient use of both human and
machine resources in high-end computing.
HPCS is particularly concerned about a vexing question: how shall we program
our high-performance computers? This leads to several subsidiary questions.
What must future programming models be like? What was wrong with previous
models? Which models should be used instead? These interrogations are the
subject of this article.
Not only do current high-end computers run too slowly, making it impossible
to compute important national problems, they cost too much. Price per peak
megaflop, which you won't get anyway, is excessive. There are also indirect
costs. Power and housing add significant expense. Paying highly skilled
programmers to write highly complex codes for these machines is an even more
significant expense. Indeed, programmability (or its absence) is often a
major factor in determining the total cost of ownership of a high-end system
over its productive lifetime.
While the high-performance computing community struggles heroically with
parallel computing, the mainstream commercial computing community simply
ignores it, focusing instead on mid-range throughput servers where the only
thing that matters is (scalar) single-thread performance. As a result, the
human and financial resources needed to address and solve the many new
problems that parallel computing brings to the table are sorely lacking.
Apart from unsuitable architectures, the main thing inhibiting the growth of
general-purpose parallel computing in both communities is the fact that
parallel programming is still much too hard. If you accept this fact (not
everyone does), then how should we proceed to improve programmability from a
research-agenda point of view? What high-level programming abstractions
should we seek? What features should we demand from new purpose-built
languages for parallel programming?
How We Got Into Trouble
Much of the vitality and breadth of the (sequential) computer revolution
arises from the ease of use and utility of general-purpose computer systems.
General-purpose computers perform well on programs written in standard
high-level (sequential) languages, and little programmer effort is required
to achieve that performance. Moreover, the same program will run well on
other general-purpose computers. Indeed, a standardized (sequential)
programming model has made it possible to spread the costs of hardware and
software development over many customers, with a resulting continuous
improvement in both performance and ease of use over many years.
Unfortunately, the "killer micros" led us astray. From 1979 to 1989, there
was a 50% improvement in processor performance per year. It therefore made
sense to apply increasingly available transistors---as Moore's law promises---
to _implicit_ parallelism, by pipelining the processor and shortening the
clock cycle. All of this could be done while keeping one program counter per
processor, so that parallel programming was never much of an issue.
However, the next decade (1989 - 1999) was an era of diminishing returns.
The rate of processor performance improvement as a function of processor cost
(die area) slowly leveled off, as the last implicit parallelism---also known
as ILP---was squeezed out of individual threads (using out-of-order execution,
branch prediction, etc.). This was justified in part because there was an
immense software pool of sequential code ready to run on these implicitly
parallel machines.
Actually, the inevitable stagnation of (scalar) single-thread processor
performance almost follows logically from the weak parallelism mechanism
(caches used spatially) available in conventional processor architectures.
We do not repeat here our view that large numbers of commodity processors
_cannot_ be cobbled together to build easy-to-use general-purpose parallel
computers to meet all the needs of the high-performance computing community.
The killer is locality: sometimes, our programmers and systems simply cannot
sustain enough of it as the problem size scales. Instead, imagine that---as
computing evolves---the single-thread processor abstraction withers away and
is replaced by multiple program counters per processor. (We do not study
vector pipelining as a parallelism mechanism here but this does not affect the
main theses in our discussion of programmabilty).
The primary question now becomes, what problems do we face in programming
high-performance computers with an abundance of fine-grained threads?
Programming Explicitly Parallel Machines
Broadly speaking, a parallel computer provides high performance as the joint
result of its _parallelism_ mechanisms and its _locality_ mechanisms.
Parallelism mechanisms tolerate latency by doing something else while waiting.
Locality mechanism avoid latency by copying data nearby. Latency tolerance
scales up if parallelism increases with size. Latency avoidance scales up if
locality increases with size.
The most familiar form of latency is (local or remote) memory latency. Recall
that remote memory latency includes the network latency. Locality mechanisms
that avoid memory latency include the use of processor registers, caches used
temporally, and nearby memory. Parallelism mechanisms that tolerate memory
(and other forms of) latency include vector pipelining, caches used spatially,
and multithreading.
Historically, different parallel computers have often dealt with memory
latency by either depending more on their parallelism mechanisms for
performance or depending more on their locality mechanisms (e.g., cacheless
vector machines versus conventional scalar machines). However, given the
performance regimes they target, next-generation parallel computers must have
a good balance between scalable parallelism mechanisms and scalable locality
mechanisms.
Massive parallelism is easily understood as an abundance of fine-grained
threads together with an abundance of fine-grained synchronization. On the
other hand, massive locality should be seen as the result of successful
partitioning: we want threads to be close to the data structures they compute,
we want the load to be balanced, we want the needs for storage to be balanced,
and we want the need for communication to be minimized.
The fundamental programming question is two-fold: 1) how (if at all) does the
programmer specify both parallelism and locality in his program text?, and
2) what is the appropriate division of labor between programmer and language
system for the extraction or generation of both parallelism and locality?
The two subsidiary questions are: 1) is the automatic extraction of
parallelism from large programs generally well-understood and generally
successful?, and 2) is the automatic extraction of locality from large
programs generally well-understood and generally successful?
One of your correspondent's theses is that the answer to the first subsidiary
question is a resounding "yes", while the answer to the second subsidiary
question is a resounding "no".
A program text for an explicitly parallel machine may specify parallelism
either implicitly or explicitly. With explicit specification, the programmer
creates independent threads of control that may interact with shared variables
and explicitly synchronize. With implicit specification, the programmer
writes a serial program that is then transformed by the compiler into parallel
form. The compiler may discover parallelism itself and/or be instructed via
programmer directives.
For current architectures that provide both abundant lightweight parallelism
and abundant lightweight synchronization, we understand very well how to
access these features from high-level languages.
For example, in Cray MTA programs, parallel code is produced for each loop by
the compiler. The compiler generates code to acquire multiple threads and
distribute iterations of the loops to these threads. Barriers are inserted by
the complier to separate the loops. The three general forms that the compiler
can handle are parallel loops, recurrences, and reductions. Probably the most
impressive examples of automatic extraction of parallelism come from the
compiler's ability to transform complicated loop nests in sophisticated ways.
In sharp contrast to the specification of parallelism, decades of experience
suggest that we do _not_ have a free choice between implicit (i.e., no) and
explicit specification of locality in our program texts. For example, when a
programmer is designing parallel graph algorithms for a static dense graph,
finding a partition for the data structures that avoids load-balancing issues
and provides localized communication is straightforward. However, for
applications with data- and/or time-dependent graph structures such as sparse
linear solvers, logic simulation, particle-in-cell codes, and "sort-middle"
polygon rendering, extracting locality becomes extremely challenging even for
the (human) programmer.
Although there are many research programs in many domains that have sought
automatic extraction of locality by some (logical) variant of graph
partitioning, the reasoned conclusion is that they have been unsuccessful in
the general case. One famous example of (attempted) automatic extraction is
David Keyes' research program to localize implicit solvers used in high-end
simulations (see HPCwire article [108052]). In the end, it seems unrealistic
to expect an automated system to handle _every_ demanding locality problem
without significant input from a programmer with a sophisticated understanding
of the computational problem. Deriving locality is not something that
general-purpose software can do. The best such software can do is to make it
easy for intelligent programmers to insert locality specifications into their
program texts.
We can draw a few conclusions already. We have adequate language constructs,
such as futures and synchronization, for expressing explicit parallelism when
we choose to do so. We can also rely on automatic extraction of parallelism
from loop nests (implicitly specified parallelism), with minor annotation.
However, since automatic extraction of locality is in general unsuccessful, we
need to design fresh language constructs that would be adequate for explicitly
specifying locality, in both the regular and irregular cases.
This problem is more difficult than it looks. One valiant effort was
automatic scheduling in High-Performance Fortran (HPF). HPF offers block or
cyclic data distribution, and schedules via the _owner-computes rule_. This
was fine for the layout of regular data structures, but programmers could not
find any way to use (original) HPF to express locality requirements for
_irregular_ data structures. HPF has fallen into disfavor: although it had a
good strategy for array distribution, it was a little too regular for its own
good.
How do you say things like, with high probability, this data element is local
to the thread that computes it? We might introduce special _objects_ as units
of locality, i.e., address spaces within which data accesses are regarded as
local and hence cheap. Thus, a thread executing in such an object is
guaranteed local access to any data structure declared within that object.
This seems too restrictive. A more general notion is that irregular data
structures are composed of many objects, which can be added or removed. For
example, when new objects are added to a data structure and new threads are
spawned to compute them, we could have a rule that says each new thread is
local to the added object it computes. This has some of the flavor of a
lightweight thread that _migrates_ to be close to the data it computes.
These are idle thoughts of your correspondent. We will have a more concrete
proposal that can and should be critically assessed by the high-performance
computing community after Cray releases Chapel (Cascade High-Productivity
Programming Language). Chapel is a new locality-aware parallel programming
language under development for the HPCS program.
Problems In Previous Models
Parallel programming, whether using conventional message passing or
conventional shared memory, is too tedious. Each of these two traditional
paradigms imposes significant burdens on the programmer, both when he tries
to express parallelism and when he tries to express locality.
Suppose that you are given Pthreads in C++ and a global address space, and
must live with that. In this case, you need to handle synchronization and to
schedule values into shared variables. This is painful. While having a
shared address space is generally desirable in a parallel computer, shared
variables are much less desirable in a parallel programming language. Since
stores do not commute with loads or other stores, data races are common in
programs written in such languages. Avoiding data races is the first part
of the programming burden.
We should avoid vanilla shared memory as a programming model. In this model,
variables directly reflect the memory hardware and programs schedule values
into variables. For parallel programs, this can be pretty tricky. Variable
references must be properly synchronized. Barriers are straightforward but
tend to oversynchronize the computation. Wait and signal are more fine
grained (hence better), but require accurate dependence information. There
are good alternatives to ordinary variables, which will be discussed below.
The absence of language constructs for the specification of locality in a
general and portable way makes locality expression in shared-memory parallel
programming even more painful than data races and synchronization. Locality
optimization competes with load balancing. Dynamic load re-balancing changes
"who's where". Data layout for irregular data structures is painful. Again,
we need a language system that makes it easy for intelligent programmers to
insert locality specifications into their program texts. Locality extraction
can, at best, only be a _semi-automated_ process.
In conventional message passing, on the other hand, the programmer burden
comes from the fact that there is a partitioned, i.e., fragmented, address
space and from the fact that the programming model only supports explicit
parallelism.
Every data structure must be explicitly partitioned so that each data element
belongs to one of the partitions of the address space. Clearly, this adds
complexity to programming. You might think that this would be good for
specifying locality. It is not. Most message-passing programs are written
using the single-program, multiple-data (SPMD) approach, for reasons of simple
practicality. The MPI threads execute within identical images, and these must
be well balanced. For dynamic and/or unstructured problems, the ability to
specify locality with MPI programming is actually quite limited.
Moreover, the message-passing programming model requires that all parallelism
be coded explicitly by the programmer. This is quite painful. We have a
number of threads, each with its own address space. Communication requires
the cooperation of the sender and the receiver. What is worse, the sender
must know too much about the receiver: does the thread still exist?; where is
it?; is it running?; is it ready for this message? The thread that has the
data must participate in the communication even if it has no logical
connection to the computation at the requesting thread.
If we are serious about programmability, we must resolutely phase out message
passing between threads (as in MPI) as a programming model. If we do not,
parallel computing will be essentially dead within ten years.
A Better Programming Model
A high-productivity parallel programming language must provide abstractions
that are higher level than those of the underlying architecture, simplifying
the structure of software and reducing the difficulty of its construction.
In particular, we don't want to expose either message passing or shared memory
as the programming model. For example, we could replace message passing with
remote procedure call (RPC), in which an RPC client calls a _stub_ procedure
that is local to the RPC server.
Programming using fine-grained message passing is too limiting, for specifying
either parallelism or locality. In contrast, shared memory allows reasonable
specification of both parallelism and locality; it also has great value as a
vehicle for delivering high bandwidth: 1) it provides low communication
overhead so that latency tolerance works, and 2) it allows fine-grained access
to data structures. Nevertheless, for reasons of programmability, we propose
to replace shared memory with a particular combination of functional and
imperative programming. (Another idle thought of your correspondent is that
there might just possibly be a distinction between the declarative and the
imperative specification of locality).
Existing programming languages are classified into families based on their
model of computation. One scheme groups the _declarative_ languages
(including the functional languages Lisp/Scheme, ML, and Haskell) separately
from the _imperative_ languages (including the von Neumann languages Fortran
and C, and the object-oriented languages Smalltalk, C++, and Java). One often
says that declarative languages are higher level, while imperative languages
are performance oriented. There are also differences in expressiveness.
For our purposes, we note that functional languages are based on expressions
that have values, while imperative languages are based on assignment
statements that influence subsequent computation by changing the value of
memory (i.e., they modify the state). Incidentally, these distinctions are
orthogonal to whether we have a _parallel_ programming language. To be
sufficiently expressive, a functional language must include some imperative
features.
Speaking strictly, _functional programming_ defines the outputs of a program
as a mathematical function of the inputs, with no notion of internal state,
and thus no side effects. For example, Haskell and Sisal are purely
functional, while Lisp/Scheme and ML include imperative features.
Experience with a variety of large commercial projects suggests that the
absence of side effects makes functional programs easier to write, debug, and
maintain than their imperative counterparts. For a given set of arguments, a
function always returns the same results. Bad scheduling of values into
variables, e.g., misordered updates, simply does not occur.
On the other hand, there are common programming idioms in which assignment
statements, i.e., modifications of the state, play a central role. Familiar
examples include initialization of complex data structures, computing
histograms, and modifying data structures in place without creating copies.
The obvious solution is to view computation as consisting of a combination of
functional evaluation and state modification.
We need language constructs that allow us to access arbitrary elements of a
complex data structure, and an implementation that is able to distinguish
when the old version of the data structure will never be used again, so that
it may be updated in place. Sisal is a nice example of a language that
combines array types and iterative syntax with purely functional semantics.
In our integrated functional/imperative parallel programming language, we
don't want to use shared memory as the programming model for the imperative
aspects. This is done in several ways.
To start with, we employ alternatives to ordinary single-word shared
variables, namely, producer-consumer variables, single-assignment variables,
and linear variables.
Producer-consumer variables, efficiently implemented with full-empty bits,
force alternation of loads and stores (premature references are forced to
wait). Thus, a store waits for empty and sets full, while a load waits for
full and sets empty. Producer-consumer variables support _value passing_
between threads (for example, in reductions and recurrences). They can also
implement barriers and wait/signal.
Single-assignment variables, which have a functional flavor, are not variables
at all, but dynamic constants (any loads that precede the store are forced to
wait). The value of a single-assignment variable is initially undefined. A
thread that attempts to read the variable will wait until it is assigned a
value by some other thread. It is a runtime error for any thread to assign to
a single-assignment variable that already has a value.
Single-assignment variables can be used to eliminate data races. For example,
we can use layers of single-assignment variables instead of barriers. A nice
example is in the code for cyclic reduction. In line with their functional
flavor, a key issue is when to reclaim them. For efficiency, dependence
analysis is required. The programmer usually knows which load is last.
Finally, linear variables are single-assignment variables that are only _used_
once. This does away for the need for functions that make copies of values.
Also, there is no uncertainty about which load is last.
We have agreed to avoid functional programming in its pure form. Dealing with
state in (relatively pure) functional languages can be awkward. Yet support
for "stateful" computation is important for efficiency and expressiveness. In
a parallel language, we also need a mechanism to perform multiple-word atomic
updates (transactions, for short). This radically changes the programmability
of atomically modifying several words in a complex data structure or
performing an operation that requires the exclusive use of several objects.
Logically, we want any one of multiple concurrent threads to be able to
acquire multiple fine-grained locks without the risk of deadlock. This is a
problem when there is no agreed-upon order for lock acquisition.
For correctness, each state operation (transaction) has a guard that defines
when the operation can be performed, and restores that guard before exiting.
In other words, there is some invariant that, say, expresses the integrity of
a complex data structure. Each atomic transaction must preserve this
invariant. When multiple transactions execute concurrently, exactly which
state transitions occur in which order is nondeterministic, but the (logical)
guard ensures that only correct states are reachable. Most often, a
particular computation contains a finite set of transactions so that, even
though the state trajectory of intermediate states is nondeterministic, the
final state is not.
Not having to worry about deadlock gives rise to an enormous increase in
programmability. The programmer must have some invariant in mind, and he
must identify a set of fine-grained locks (essentially, the footprint of the
transaction within the entire set of variables that are constrained by the
invariant) such that holding all of them is sufficient to ensure the atomicity
of the update. The language system must implement this so that the programmer
can dynamically acquire locks, e.g., as they become identified through pointer
chasing, and no lock-acquisition order ever leads to deadlock.
The language construct required for transactions is just a guard, expressing
the nested acquisition of a set of locks, protecting a statement list. The
system implementation extends the use of full-empty bits. Producer-consumer
variables are supported by full-empty bits. A trap occurs after a thread has
waited for a while; the trap handler enqueues the thread for later resumption.
Transactions are implemented by locking objects incrementally, creating a
linked list of objects that have been locked. If all locks are acquired, the
transaction commits and all locks are released. If the acquisition of locks
blocks midway, all acquired locks are released in reverse order and the
transaction aborts. A separate retry mechanism is required.
Architectural And System Support
The architectural support required for new programming languages is closely
conditioned on the "additional features" they provide. In general, we require
special hardware support both to implement single-assignment languages and
when fine-grained synchronization and fine-grained atomicity are vital
elements of the programming model. Architecture and language are inextricably
linked: to improve either, we must improve both. For example, MPI directly
reflects an architectural idea, that of "nodes" communicating via heavyweight
messages.
Still, we may have very different programming goals from those permitted by
the two status-quo programming systems, OpenMP and MPI. For example, we might
want: 1) to allow fine-grained, anonymous communication between threads, 2) to
enable dynamic scheduling and load balancing, 3) to exploit diverse forms of
parallelism, and 4) to exploit diverse forms of locality.
A language is also supported by its compiler technology. Not by accident,
some of the enabling compiler optimization technologies that support compiling
for Cascade lightweight threads, which migrate to the data they require, also
support the Chapel programming language. These technologies include: 1)
dependence analysis, 2) interprocedural fact propagation, 3) dynamic loop-nest
scheduling, and 4) procedure inlining. It will be interesting to see how
Chapel's syntactic constructs for specifying locality, and their underlying
implementation, interact with Cascade's main locality mechanisms (if they do).
But compiler technology will improve programmer productivity only if it is
used to implement appropriate language features. We have suggested that such
features include 1) proper elements of functional programming, 2) concepts
from functional programming such as single-assignment variables, augmented by
transactions, to tame "stateful" computation when using the imperative
elements, as well as 3) distinct syntactic constructs for specifying
parallelism and locality, to enforce separation of concerns but also to allow
general locality extraction in the first place.
More traditional architectural features (traditional, in any case, in the
high-bandwidth camp) also improve programmability. We mention (or recall):
1) shared memory with lightweight synchronization that supports both
fine-grained access to data structures and zero-overhead communication, 2) a
high-bandwidth interconnection network enabled by a sophisticated network
topology and intelligent routers, and in Cascade, 3) distinct processor types
that execute, respectively, heavyweight threads with high temporal locality
and lightweight threads with (at least some) spatial locality, which maximizes
the amount of computation that is obtained by consuming a given amount of
bandwidth.
But Chapel's main contribution to parallel programming appears to lie in a
vast improvement in expressiveness, with large performance consequences.
There is a global view of computation and data structures. There is support
for structured data and task parallelism, including domains such as sparse
arrays and graphs, and user-extensible distributions, reductions, and
iterators. There is the ability to tune for (or ignore) locality using
_domains_.
Chapel retains the composability that is such a desirable feature of
functional languages while improving generality by allowing manipulation of
the state. Functional languages are deterministic; introducing state leads to
nondeterminism (e.g., races). We have discussed transactions as a mechanism
that allows only good nondeterminism in our programs: we proceed
nondeterministically through a trajectory of correct intermediate states to
a (deterministic) correct final state. As with locality, we are allowing the
programmer to specify something that the system itself does not understand.
Conclusion
High-end computing is experiencing a full-blown productivity crisis, whose
dimensions are only now beginning to be understood. We have entered a period
of decreasing productivity as measured by the programming skill required, each
time a significantly higher-capability machine is acquired, to make a lasting
impact on an applications area by repeatedly developing new applications for
the new machine. Moreover, this productivity shortfall inhibits
general-purpose parallel computing from ever claiming a larger share of the
vast commercial market.
Weakly parallel, low-bandwidth systems do not support fine-grained parallelism
due to communication and synchronization costs. These systems are suited only
to _easily localizable_ applications, i.e., those applications without
significant long-range communication due to algorithmic structure, and where
load balancing is straightforward due to regularity and non-adaptivity.
Whether explicit locality specification will improve the performance
programming of these machines is unclear (but interesting).
Modern strongly parallel, high-bandwidth systems _do_ support fine-grained
parallelism, and already possess many of the hardware hooks that would be
required to implement an integrated functional/imperative parallel programming
language that uses neither message passing nor shared memory as its
programming model. Designing an efficient optimizing compiler for such a
language and such systems might lead to improvements in both languages and
architectures.
The improvement in productivity in the proposed scheme comes from the
interaction between 1) combined functional and imperative elements, and 2)
combined (but orthogonal) specification of parallelism and locality. The goal
is to enhance productivity by achieving both generality and composability for
the harried programmer of our high-end systems.
The High-End Crusader, a noted expert in high-performance computing and
communications, shall remain anonymous. He alone bears responsibility for
these commentaries. Replies are welcome and may be sent to HPCwire editor
Tim Curns at tim@hpcwire.com.
|