HPCwire
 The global publication of record for High Performance Computing / September 17, 2004: Vol. 13, No. 37

Previous Article   |  Table of Contents  |  

Features:

HIGH-END COMPUTING NEEDS RADICAL PROGRAMMING CHANGE
Commentary from the High-End Crusader

The concern behind DARPA's High-Productivity Computing Systems (HPCS) program has always been the unacceptably high total cost of ownership, for the value delivered, of the high-end systems required in the national-security and industrial-user communities. Typical applications here are theater weather prediction, materials science, and aerospace vehicle design. Ironically, the greatest impact of the HPCS program could eventually be on computing as a whole, which would be the case precisely if general-purpose parallel computing could somehow be brought into the mainstream of computing as a whole.

The HPCS program focuses on the performance, programmability, portability, robustness, and ultimately productivity of future high-end computing systems. To counter excessive hardware and software costs, this program targets new machines that integrate innovative computer architectures and new programming language systems to foster radically more efficient use of both human and machine resources in high-end computing.

HPCS is particularly concerned about a vexing question: how shall we program our high-performance computers? This leads to several subsidiary questions. What must future programming models be like? What was wrong with previous models? Which models should be used instead? These interrogations are the subject of this article.

Not only do current high-end computers run too slowly, making it impossible to compute important national problems, they cost too much. Price per peak megaflop, which you won't get anyway, is excessive. There are also indirect costs. Power and housing add significant expense. Paying highly skilled programmers to write highly complex codes for these machines is an even more significant expense. Indeed, programmability (or its absence) is often a major factor in determining the total cost of ownership of a high-end system over its productive lifetime.

While the high-performance computing community struggles heroically with parallel computing, the mainstream commercial computing community simply ignores it, focusing instead on mid-range throughput servers where the only thing that matters is (scalar) single-thread performance. As a result, the human and financial resources needed to address and solve the many new problems that parallel computing brings to the table are sorely lacking.

Apart from unsuitable architectures, the main thing inhibiting the growth of general-purpose parallel computing in both communities is the fact that parallel programming is still much too hard. If you accept this fact (not everyone does), then how should we proceed to improve programmability from a research-agenda point of view? What high-level programming abstractions should we seek? What features should we demand from new purpose-built languages for parallel programming?

How We Got Into Trouble

Much of the vitality and breadth of the (sequential) computer revolution arises from the ease of use and utility of general-purpose computer systems. General-purpose computers perform well on programs written in standard high-level (sequential) languages, and little programmer effort is required to achieve that performance. Moreover, the same program will run well on other general-purpose computers. Indeed, a standardized (sequential) programming model has made it possible to spread the costs of hardware and software development over many customers, with a resulting continuous improvement in both performance and ease of use over many years.

Unfortunately, the "killer micros" led us astray. From 1979 to 1989, there was a 50% improvement in processor performance per year. It therefore made sense to apply increasingly available transistors---as Moore's law promises--- to _implicit_ parallelism, by pipelining the processor and shortening the clock cycle. All of this could be done while keeping one program counter per processor, so that parallel programming was never much of an issue.

However, the next decade (1989 - 1999) was an era of diminishing returns. The rate of processor performance improvement as a function of processor cost (die area) slowly leveled off, as the last implicit parallelism---also known as ILP---was squeezed out of individual threads (using out-of-order execution, branch prediction, etc.). This was justified in part because there was an immense software pool of sequential code ready to run on these implicitly parallel machines.

Actually, the inevitable stagnation of (scalar) single-thread processor performance almost follows logically from the weak parallelism mechanism (caches used spatially) available in conventional processor architectures. We do not repeat here our view that large numbers of commodity processors _cannot_ be cobbled together to build easy-to-use general-purpose parallel computers to meet all the needs of the high-performance computing community. The killer is locality: sometimes, our programmers and systems simply cannot sustain enough of it as the problem size scales. Instead, imagine that---as computing evolves---the single-thread processor abstraction withers away and is replaced by multiple program counters per processor. (We do not study vector pipelining as a parallelism mechanism here but this does not affect the main theses in our discussion of programmabilty).

The primary question now becomes, what problems do we face in programming high-performance computers with an abundance of fine-grained threads?

Programming Explicitly Parallel Machines

Broadly speaking, a parallel computer provides high performance as the joint result of its _parallelism_ mechanisms and its _locality_ mechanisms. Parallelism mechanisms tolerate latency by doing something else while waiting. Locality mechanism avoid latency by copying data nearby. Latency tolerance scales up if parallelism increases with size. Latency avoidance scales up if locality increases with size.

The most familiar form of latency is (local or remote) memory latency. Recall that remote memory latency includes the network latency. Locality mechanisms that avoid memory latency include the use of processor registers, caches used temporally, and nearby memory. Parallelism mechanisms that tolerate memory (and other forms of) latency include vector pipelining, caches used spatially, and multithreading.

Historically, different parallel computers have often dealt with memory latency by either depending more on their parallelism mechanisms for performance or depending more on their locality mechanisms (e.g., cacheless vector machines versus conventional scalar machines). However, given the performance regimes they target, next-generation parallel computers must have a good balance between scalable parallelism mechanisms and scalable locality mechanisms.

Massive parallelism is easily understood as an abundance of fine-grained threads together with an abundance of fine-grained synchronization. On the other hand, massive locality should be seen as the result of successful partitioning: we want threads to be close to the data structures they compute, we want the load to be balanced, we want the needs for storage to be balanced, and we want the need for communication to be minimized.

The fundamental programming question is two-fold: 1) how (if at all) does the programmer specify both parallelism and locality in his program text?, and 2) what is the appropriate division of labor between programmer and language system for the extraction or generation of both parallelism and locality? The two subsidiary questions are: 1) is the automatic extraction of parallelism from large programs generally well-understood and generally successful?, and 2) is the automatic extraction of locality from large programs generally well-understood and generally successful?

One of your correspondent's theses is that the answer to the first subsidiary question is a resounding "yes", while the answer to the second subsidiary question is a resounding "no".

A program text for an explicitly parallel machine may specify parallelism either implicitly or explicitly. With explicit specification, the programmer creates independent threads of control that may interact with shared variables and explicitly synchronize. With implicit specification, the programmer writes a serial program that is then transformed by the compiler into parallel form. The compiler may discover parallelism itself and/or be instructed via programmer directives.

For current architectures that provide both abundant lightweight parallelism and abundant lightweight synchronization, we understand very well how to access these features from high-level languages.

For example, in Cray MTA programs, parallel code is produced for each loop by the compiler. The compiler generates code to acquire multiple threads and distribute iterations of the loops to these threads. Barriers are inserted by the complier to separate the loops. The three general forms that the compiler can handle are parallel loops, recurrences, and reductions. Probably the most impressive examples of automatic extraction of parallelism come from the compiler's ability to transform complicated loop nests in sophisticated ways.

In sharp contrast to the specification of parallelism, decades of experience suggest that we do _not_ have a free choice between implicit (i.e., no) and explicit specification of locality in our program texts. For example, when a programmer is designing parallel graph algorithms for a static dense graph, finding a partition for the data structures that avoids load-balancing issues and provides localized communication is straightforward. However, for applications with data- and/or time-dependent graph structures such as sparse linear solvers, logic simulation, particle-in-cell codes, and "sort-middle" polygon rendering, extracting locality becomes extremely challenging even for the (human) programmer.

Although there are many research programs in many domains that have sought automatic extraction of locality by some (logical) variant of graph partitioning, the reasoned conclusion is that they have been unsuccessful in the general case. One famous example of (attempted) automatic extraction is David Keyes' research program to localize implicit solvers used in high-end simulations (see HPCwire article [108052]). In the end, it seems unrealistic to expect an automated system to handle _every_ demanding locality problem without significant input from a programmer with a sophisticated understanding of the computational problem. Deriving locality is not something that general-purpose software can do. The best such software can do is to make it easy for intelligent programmers to insert locality specifications into their program texts.

We can draw a few conclusions already. We have adequate language constructs, such as futures and synchronization, for expressing explicit parallelism when we choose to do so. We can also rely on automatic extraction of parallelism from loop nests (implicitly specified parallelism), with minor annotation. However, since automatic extraction of locality is in general unsuccessful, we need to design fresh language constructs that would be adequate for explicitly specifying locality, in both the regular and irregular cases.

This problem is more difficult than it looks. One valiant effort was automatic scheduling in High-Performance Fortran (HPF). HPF offers block or cyclic data distribution, and schedules via the _owner-computes rule_. This was fine for the layout of regular data structures, but programmers could not find any way to use (original) HPF to express locality requirements for _irregular_ data structures. HPF has fallen into disfavor: although it had a good strategy for array distribution, it was a little too regular for its own good.

How do you say things like, with high probability, this data element is local to the thread that computes it? We might introduce special _objects_ as units of locality, i.e., address spaces within which data accesses are regarded as local and hence cheap. Thus, a thread executing in such an object is guaranteed local access to any data structure declared within that object. This seems too restrictive. A more general notion is that irregular data structures are composed of many objects, which can be added or removed. For example, when new objects are added to a data structure and new threads are spawned to compute them, we could have a rule that says each new thread is local to the added object it computes. This has some of the flavor of a lightweight thread that _migrates_ to be close to the data it computes.

These are idle thoughts of your correspondent. We will have a more concrete proposal that can and should be critically assessed by the high-performance computing community after Cray releases Chapel (Cascade High-Productivity Programming Language). Chapel is a new locality-aware parallel programming language under development for the HPCS program.

Problems In Previous Models

Parallel programming, whether using conventional message passing or conventional shared memory, is too tedious. Each of these two traditional paradigms imposes significant burdens on the programmer, both when he tries to express parallelism and when he tries to express locality.

Suppose that you are given Pthreads in C++ and a global address space, and must live with that. In this case, you need to handle synchronization and to schedule values into shared variables. This is painful. While having a shared address space is generally desirable in a parallel computer, shared variables are much less desirable in a parallel programming language. Since stores do not commute with loads or other stores, data races are common in programs written in such languages. Avoiding data races is the first part of the programming burden.

We should avoid vanilla shared memory as a programming model. In this model, variables directly reflect the memory hardware and programs schedule values into variables. For parallel programs, this can be pretty tricky. Variable references must be properly synchronized. Barriers are straightforward but tend to oversynchronize the computation. Wait and signal are more fine grained (hence better), but require accurate dependence information. There are good alternatives to ordinary variables, which will be discussed below.

The absence of language constructs for the specification of locality in a general and portable way makes locality expression in shared-memory parallel programming even more painful than data races and synchronization. Locality optimization competes with load balancing. Dynamic load re-balancing changes "who's where". Data layout for irregular data structures is painful. Again, we need a language system that makes it easy for intelligent programmers to insert locality specifications into their program texts. Locality extraction can, at best, only be a _semi-automated_ process.

In conventional message passing, on the other hand, the programmer burden comes from the fact that there is a partitioned, i.e., fragmented, address space and from the fact that the programming model only supports explicit parallelism.

Every data structure must be explicitly partitioned so that each data element belongs to one of the partitions of the address space. Clearly, this adds complexity to programming. You might think that this would be good for specifying locality. It is not. Most message-passing programs are written using the single-program, multiple-data (SPMD) approach, for reasons of simple practicality. The MPI threads execute within identical images, and these must be well balanced. For dynamic and/or unstructured problems, the ability to specify locality with MPI programming is actually quite limited.

Moreover, the message-passing programming model requires that all parallelism be coded explicitly by the programmer. This is quite painful. We have a number of threads, each with its own address space. Communication requires the cooperation of the sender and the receiver. What is worse, the sender must know too much about the receiver: does the thread still exist?; where is it?; is it running?; is it ready for this message? The thread that has the data must participate in the communication even if it has no logical connection to the computation at the requesting thread.

If we are serious about programmability, we must resolutely phase out message passing between threads (as in MPI) as a programming model. If we do not, parallel computing will be essentially dead within ten years.

A Better Programming Model

A high-productivity parallel programming language must provide abstractions that are higher level than those of the underlying architecture, simplifying the structure of software and reducing the difficulty of its construction. In particular, we don't want to expose either message passing or shared memory as the programming model. For example, we could replace message passing with remote procedure call (RPC), in which an RPC client calls a _stub_ procedure that is local to the RPC server.

Programming using fine-grained message passing is too limiting, for specifying either parallelism or locality. In contrast, shared memory allows reasonable specification of both parallelism and locality; it also has great value as a vehicle for delivering high bandwidth: 1) it provides low communication overhead so that latency tolerance works, and 2) it allows fine-grained access to data structures. Nevertheless, for reasons of programmability, we propose to replace shared memory with a particular combination of functional and imperative programming. (Another idle thought of your correspondent is that there might just possibly be a distinction between the declarative and the imperative specification of locality).

Existing programming languages are classified into families based on their model of computation. One scheme groups the _declarative_ languages (including the functional languages Lisp/Scheme, ML, and Haskell) separately from the _imperative_ languages (including the von Neumann languages Fortran and C, and the object-oriented languages Smalltalk, C++, and Java). One often says that declarative languages are higher level, while imperative languages are performance oriented. There are also differences in expressiveness.

For our purposes, we note that functional languages are based on expressions that have values, while imperative languages are based on assignment statements that influence subsequent computation by changing the value of memory (i.e., they modify the state). Incidentally, these distinctions are orthogonal to whether we have a _parallel_ programming language. To be sufficiently expressive, a functional language must include some imperative features.

Speaking strictly, _functional programming_ defines the outputs of a program as a mathematical function of the inputs, with no notion of internal state, and thus no side effects. For example, Haskell and Sisal are purely functional, while Lisp/Scheme and ML include imperative features.

Experience with a variety of large commercial projects suggests that the absence of side effects makes functional programs easier to write, debug, and maintain than their imperative counterparts. For a given set of arguments, a function always returns the same results. Bad scheduling of values into variables, e.g., misordered updates, simply does not occur.

On the other hand, there are common programming idioms in which assignment statements, i.e., modifications of the state, play a central role. Familiar examples include initialization of complex data structures, computing histograms, and modifying data structures in place without creating copies. The obvious solution is to view computation as consisting of a combination of functional evaluation and state modification.

We need language constructs that allow us to access arbitrary elements of a complex data structure, and an implementation that is able to distinguish when the old version of the data structure will never be used again, so that it may be updated in place. Sisal is a nice example of a language that combines array types and iterative syntax with purely functional semantics.

In our integrated functional/imperative parallel programming language, we don't want to use shared memory as the programming model for the imperative aspects. This is done in several ways.

To start with, we employ alternatives to ordinary single-word shared variables, namely, producer-consumer variables, single-assignment variables, and linear variables.

Producer-consumer variables, efficiently implemented with full-empty bits, force alternation of loads and stores (premature references are forced to wait). Thus, a store waits for empty and sets full, while a load waits for full and sets empty. Producer-consumer variables support _value passing_ between threads (for example, in reductions and recurrences). They can also implement barriers and wait/signal.

Single-assignment variables, which have a functional flavor, are not variables at all, but dynamic constants (any loads that precede the store are forced to wait). The value of a single-assignment variable is initially undefined. A thread that attempts to read the variable will wait until it is assigned a value by some other thread. It is a runtime error for any thread to assign to a single-assignment variable that already has a value.

Single-assignment variables can be used to eliminate data races. For example, we can use layers of single-assignment variables instead of barriers. A nice example is in the code for cyclic reduction. In line with their functional flavor, a key issue is when to reclaim them. For efficiency, dependence analysis is required. The programmer usually knows which load is last.

Finally, linear variables are single-assignment variables that are only _used_ once. This does away for the need for functions that make copies of values. Also, there is no uncertainty about which load is last.

We have agreed to avoid functional programming in its pure form. Dealing with state in (relatively pure) functional languages can be awkward. Yet support for "stateful" computation is important for efficiency and expressiveness. In a parallel language, we also need a mechanism to perform multiple-word atomic updates (transactions, for short). This radically changes the programmability of atomically modifying several words in a complex data structure or performing an operation that requires the exclusive use of several objects.

Logically, we want any one of multiple concurrent threads to be able to acquire multiple fine-grained locks without the risk of deadlock. This is a problem when there is no agreed-upon order for lock acquisition.

For correctness, each state operation (transaction) has a guard that defines when the operation can be performed, and restores that guard before exiting. In other words, there is some invariant that, say, expresses the integrity of a complex data structure. Each atomic transaction must preserve this invariant. When multiple transactions execute concurrently, exactly which state transitions occur in which order is nondeterministic, but the (logical) guard ensures that only correct states are reachable. Most often, a particular computation contains a finite set of transactions so that, even though the state trajectory of intermediate states is nondeterministic, the final state is not.

Not having to worry about deadlock gives rise to an enormous increase in programmability. The programmer must have some invariant in mind, and he must identify a set of fine-grained locks (essentially, the footprint of the transaction within the entire set of variables that are constrained by the invariant) such that holding all of them is sufficient to ensure the atomicity of the update. The language system must implement this so that the programmer can dynamically acquire locks, e.g., as they become identified through pointer chasing, and no lock-acquisition order ever leads to deadlock.

The language construct required for transactions is just a guard, expressing the nested acquisition of a set of locks, protecting a statement list. The system implementation extends the use of full-empty bits. Producer-consumer variables are supported by full-empty bits. A trap occurs after a thread has waited for a while; the trap handler enqueues the thread for later resumption. Transactions are implemented by locking objects incrementally, creating a linked list of objects that have been locked. If all locks are acquired, the transaction commits and all locks are released. If the acquisition of locks blocks midway, all acquired locks are released in reverse order and the transaction aborts. A separate retry mechanism is required.

Architectural And System Support

The architectural support required for new programming languages is closely conditioned on the "additional features" they provide. In general, we require special hardware support both to implement single-assignment languages and when fine-grained synchronization and fine-grained atomicity are vital elements of the programming model. Architecture and language are inextricably linked: to improve either, we must improve both. For example, MPI directly reflects an architectural idea, that of "nodes" communicating via heavyweight messages.

Still, we may have very different programming goals from those permitted by the two status-quo programming systems, OpenMP and MPI. For example, we might want: 1) to allow fine-grained, anonymous communication between threads, 2) to enable dynamic scheduling and load balancing, 3) to exploit diverse forms of parallelism, and 4) to exploit diverse forms of locality.

A language is also supported by its compiler technology. Not by accident, some of the enabling compiler optimization technologies that support compiling for Cascade lightweight threads, which migrate to the data they require, also support the Chapel programming language. These technologies include: 1) dependence analysis, 2) interprocedural fact propagation, 3) dynamic loop-nest scheduling, and 4) procedure inlining. It will be interesting to see how Chapel's syntactic constructs for specifying locality, and their underlying implementation, interact with Cascade's main locality mechanisms (if they do).

But compiler technology will improve programmer productivity only if it is used to implement appropriate language features. We have suggested that such features include 1) proper elements of functional programming, 2) concepts from functional programming such as single-assignment variables, augmented by transactions, to tame "stateful" computation when using the imperative elements, as well as 3) distinct syntactic constructs for specifying parallelism and locality, to enforce separation of concerns but also to allow general locality extraction in the first place.

More traditional architectural features (traditional, in any case, in the high-bandwidth camp) also improve programmability. We mention (or recall): 1) shared memory with lightweight synchronization that supports both fine-grained access to data structures and zero-overhead communication, 2) a high-bandwidth interconnection network enabled by a sophisticated network topology and intelligent routers, and in Cascade, 3) distinct processor types that execute, respectively, heavyweight threads with high temporal locality and lightweight threads with (at least some) spatial locality, which maximizes the amount of computation that is obtained by consuming a given amount of bandwidth.

But Chapel's main contribution to parallel programming appears to lie in a vast improvement in expressiveness, with large performance consequences. There is a global view of computation and data structures. There is support for structured data and task parallelism, including domains such as sparse arrays and graphs, and user-extensible distributions, reductions, and iterators. There is the ability to tune for (or ignore) locality using _domains_.

Chapel retains the composability that is such a desirable feature of functional languages while improving generality by allowing manipulation of the state. Functional languages are deterministic; introducing state leads to nondeterminism (e.g., races). We have discussed transactions as a mechanism that allows only good nondeterminism in our programs: we proceed nondeterministically through a trajectory of correct intermediate states to a (deterministic) correct final state. As with locality, we are allowing the programmer to specify something that the system itself does not understand.

Conclusion

High-end computing is experiencing a full-blown productivity crisis, whose dimensions are only now beginning to be understood. We have entered a period of decreasing productivity as measured by the programming skill required, each time a significantly higher-capability machine is acquired, to make a lasting impact on an applications area by repeatedly developing new applications for the new machine. Moreover, this productivity shortfall inhibits general-purpose parallel computing from ever claiming a larger share of the vast commercial market.

Weakly parallel, low-bandwidth systems do not support fine-grained parallelism due to communication and synchronization costs. These systems are suited only to _easily localizable_ applications, i.e., those applications without significant long-range communication due to algorithmic structure, and where load balancing is straightforward due to regularity and non-adaptivity. Whether explicit locality specification will improve the performance programming of these machines is unclear (but interesting).

Modern strongly parallel, high-bandwidth systems _do_ support fine-grained parallelism, and already possess many of the hardware hooks that would be required to implement an integrated functional/imperative parallel programming language that uses neither message passing nor shared memory as its programming model. Designing an efficient optimizing compiler for such a language and such systems might lead to improvements in both languages and architectures.

The improvement in productivity in the proposed scheme comes from the interaction between 1) combined functional and imperative elements, and 2) combined (but orthogonal) specification of parallelism and locality. The goal is to enhance productivity by achieving both generality and composability for the harried programmer of our high-end systems.


The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous. He alone bears responsibility for these commentaries. Replies are welcome and may be sent to HPCwire editor Tim Curns at tim@hpcwire.com.


Top of Page

Previous Article   |  Table of Contents  |