The global publication of record for High Performance Computing / February 4, 2005: Vol. 14, No. 5

Previous Article   |  Table of Contents  |  


Commentary from the High-End Crusader

The long-awaited, comprehensive, and tightly argued report "Getting Up to Speed: The Future of Supercomputing", issued in November by the National Academies' National Research Council, deserves careful consideration. The report demonstrates conclusively -- at least to this observer -- that the government must take primary responsibility for the problem of supercomputing. This proposal merits the close attention of the high-performance computing community -- quite apart from the immense scope of what is at stake -- both because it has the requisite boldness to make a difference to our desperate plight and because there is sustained opposition to it from three broad sources: 1) certain retrograde sectors of some federal agencies, notably within both DOE and DoD, 2) certain computer vendors, who shall remain nameless, and 3) the administration's Office of Management and Budget (OMB), which recently emasculated the High-End Computing Revitalization Task Force (HECRTF).

In one sense, the central proposal of "Getting Up to Speed" is an immediate corollary of the core conviction of what may be called the "progressive camp" in high-end computing, which apparently held a slim voting majority within the committee that wrote the report. High-end computing progressives share the well-founded belief that supercomputing is in deep trouble. Three simple situations illustrate this trouble: 1) supercomputer architectures stagnate as PC-based clusters dominate, 2) parallel programming languages stagnate as MPI reigns supreme, and 3) computational-engineering applications stagnate as inappropriate platforms and programming difficulties cause industry to "think small".

The committee on the future of supercomputing concluded that strong U.S. government leadership and bold new government policies are required to meet obvious national needs for supercomputing given the inevitable technological consequences of continuing with a status quo in which technology advances are driven almost exclusively by commercial market forces. The committee noted: "Several factors have led to the recent reexamination of the rationale for federal investment in research and development in support of high-performance computing, including 1) continuing changes in the various component technologies and their markets, 2) the evolution of the computing market, particularly the high-end supercomputing segment, 3) experience with several systems using the clustered processor architecture, and 4) the evolution of the problems, many of them mission driven, for which supercomputers are used".

The committee's overall recommendation is this: "To meet the current and future needs of the United States, the government agencies that depend on supercomputing, together with the U.S. Congress, need to take primary responsibility for accelerating advances in supercomputing and ensuring that there are multiple strong domestic suppliers of both hardware and software".

In simple language, the government must own the problem of supercomputing, just the way, for example, it would need to own the proposed mission to Mars or the way it currently _does_ own the war in Iraq. Such problems cannot be left to the private sector.

There are two potential misconceptions here. First, to say that the government owns, i.e., takes primary responsibility for, the problem of supercomputing is _not_ to say that the government will fund all the research and development efforts to advance supercomputing that would not have occurred in the absence of the new government program. Given the right sort of government leadership, possibly including tough new legislation, we may anticipate combined investment by government _and_ industry in funding supercomputing advances. This is feasible provided the government changes the vendor incentive space so that computer-industry research and development decisions are no longer driven quite so exclusively by _current_ commercial market forces.

Second, to say that the government sets research priorities is _not_ to say that the government will back particular technological solutions, at least not until their merit has been demonstrated by extensive test and evaluation. Rather, the difficult task of drawing up a (constantly evolving) roadmap for the future of supercomputing consists of identifing a (constantly evolving) set of problems that need to be solved in order for supercomputing to advance, without any a priori bias as to which technological solutions best solve the identified problems. The government needs to articulate the major roadblocks that are holding supercomputing back, not specify the detailed solutions. But this leadership is not for the faint-hearted; the government will need to repeatedly redefine the nation's supercomputing research priorities. A muscular approach is required. Also, this is a _permanent_ activist role for the government.

"Getting Up to Speed" is a surprisingly comprehensive report that carefully explains why the government must assume primary responsibility for the problem of supercomputing, and then describes how it might go about doing this. In this article, we will summarize the main ideas, correcting any mis-statements that may have been slipped into the report by "retrograde forces". The most glaring mis-statement is the unsubstantiated assertion, often repeated, that the evolving supercomputer market will necessarily follow a particular pessimal path. As a side benefit, correcting this flawed prophecy makes government ownership of the problem of supercomputing infinitely more sustainable.

Do We Understand Application Diversity?

The committee refers to the main problem early in the executive summary: "The advances in mainstream computing brought about by improved processor performance have enabled some former supercomputing needs to be addressed by clusters of commodity processors. Yet important applications, some vital to our nation's security, require technology that is only available in the most advanced custom-built systems. We have been remiss in attending to the conduct of long-term research and development and to the sustenance of the industrial capabilities that will also be needed".

The familiar idea here is that some applications cannot be computed on even large-scale configurations of some high-performance computer architectures (generally speaking, on loosely-coupled systems). After much debate, there is reasonable agreement nowadays about the existence of two broad classes of high-performance computer architectures, which your correspondent refers to as high-bandwidth and low-bandwidth systems, and also about the existence of at least some important applications that require high-bandwidth systems.

What is totally absent from the report is any rational calculus to determine the intrinsic relative weights of potential high-bandwidth and low-bandwidth applications in general-purpose parallel computing. We all agree that any casual survey of which applications are being run _today_ would show the clear numerical dominance of low-bandwidth applications. But few people have asked whether this dominance is something intrinsic to the nature of general-purpose parallel computing or rather merely an artifact of the limited-capacity (i.e., low-bandwidth) machines that -- today at any rate -- dominate our shop floors.

You cannot deduce what a user community would like to compute from what it happens to compute.

Statements that low-bandwidth systems, as a general rule, will always satisfy the vast majority of parallel-computing applications are made repeatedly throughout "Getting Up to Speed" without any discernible attempt at substantiation. The authors even think the current majority of low-bandwidth applications within all supercomputer applications will necessarily increase! In your correspondent's humble opinion, this characterization of the necessary evolution of the supercomputing market is a dangerous myth. Certainly, it can't just be taken for granted, as if its truth were manifest.

Supercomputing is in trouble because a potentially significant fraction of parallel computing (namely, the set of high-bandwidth applications) risks not having the high-bandwidth systems it needs and because the current situation -- essentially, the current incentive space for vendors, together with some remarkably stable nonuniform performance-scaling trends -- in which technological advances are driven by _perceptions_ of what the commercial marketplace will reward, seems guaranteed to foreclose any possiblity of meaningful innovation in supercomputing technology.

We need to understand the extent to which, in supercomputing, the attitude "The market doesn't want it; I won't offer it" is a self-fulfilling prophecy. If you only sell low-bandwidth systems, the users who make up the supercomputer market will make do -- for a while. Indeed, the report shows that, because of nonuniform performance scaling, the current situation is not sustainable. So, assuming the possibility of government leadership, why not work now to change the supercomputing market? Vendors might be _amazed_ to learn what a broad still-to-be-educated market truly wants, if only it had a better understanding of what genuine supercomputing is and/or thought there was some chance of getting it.

We can easily explain why system 's' does not compute application 'a' to the satisfaction of user community 'u' (typically, this occurs if 'a' suffers from _latency disease_ when it runs on 's'). The latency to access local memory through local interconnect is quite large when measured in processor cycles. The latency to access global memory through system interconnect is considerably larger (for decent-size configurations, anyway). Now, it may be that 'a' cannot be localized on 's' with the result that there is significant short-range or long-range communication. It may also be that the required communication cannot be parallelized on 's' with the result that some critical processing resources lie fallow while waiting for high-latency communication operations to complete. This is latency disease.

Like communication, synchronization is another source of latency disease. This is obvious. Efficient parallel computing requires that large numbers of parallel activities can share data well, i.e., cheaply, and can also synchronize well. (No machine-wide barrier synchronizations, please!). Given sufficient task variability, load balancing can be another significant issue. Bandwidth is the starting point for solving any of these problems.

Moreover, it may be that it is hard to program application 'a' on system 's' with the result that considerable time is spent getting 'a' up and running. Fragmented memory, i.e., the programming model used in MPI message passing, is the commonest cause of programming-difficulty disease. Since _time to solution_ is the sum of programming time and execution time, it may be that the utility function of user community 'u' assigns little value to a solution obtained after such a long time. It may even be that 'a' cannot be computed at all on 's' (i.e., the utility function assigns a value of zero to the solution). The last two are examples of time-to-solution disease.

Do Commodity Processors Have A Special Character?

Commodity processors, by definition, are designed for a broad market and are manufactured in large numbers. At present, because of a particular reading of the potential commercial market, commodity processors are optimized for applications that exhibit significant spatial and temporal locality. As a result, commodity processors have no good mechanisms for increasing the rate at which operands can be transferred between the processor and either the local or the global memory. Commodity processors have been optimized to work well when locality does away with the need for such communication.

Similarly, at present, because of a historical reluctance to rely on locality for performance, custom processors are optimized for applications where there is significant local or global communication. Consider communication to local memory through local interconnect. A system built from custom processors will provide high bandwidth to local memory -- this is essential -- and will also provide some parallelism mechanism that sustains high memory-reference concurrency (many outstanding loads in every cycle) in the face of different memory-access patterns. Concurrency is necessary to turn potential operand bandwidth into actual operand bandwidth. Whether this concurrency is provided by vector processors or multithreaded processors or streaming processors (or something entirely new) is secondary.

Given an application's need for significant local communication, we must have both high hardware bandwidth to local memory and high memory-reference concurrency to local memory to sustain that hardware bandwidth. A system is _locally balanced_ if it can sustain its local hardware bandwidth.

Now, consider communication to global memory through system interconnect. Although the abstract performance problems of local and global communication are identical, in practice the differences can be significant. First, we would like to provide high bandwidth to global memory. This is certainly possible in principle. High-speed electrical- and optical-signaling technologies enable high raw bandwidth to be provided at a reasonable cost. High-radix routers enable tens of thousands of nodes to be connected with just a few hops. However, the cost and power of providing bandwidth is decreasing more slowly than the cost and power of providing logic, with the result that the total system cost and power budgets of a large high-bandwidth system may easily be dominated by the cost and power of the system interconnect.

Also, the report claims that "it is prohibitively expensive to provide flat [network/] memory bandwidth across a supercomputer", with the resulting claimed need to accept the inevitability of severe bandwidth taper. This assertion may be criticized as needing _several_ qualifications (e.g., with respect to the targeted performance regime), but there can be no question that providing affordable scalable _reasonably uniform_ exceptional global bandwidth in the system interconnect is a formidable engineering challenge. It is likely that this problem will always be with us as we move to larger systems and higher performance regimes over time.

Second, given exceptional global bandwidth in the system interconnect, how can it be sustained? Consider a large parallel system built using vector processors. If the system has a global (possibly distributed) shared memory (i.e., if we are dealing with a true vector multiprocessor such as the Cray X1), then vector loads can provide the memory-reference concurrency to help tolerate the latency of global communication. However, if there are vector SMP nodes in the system that perform global (i.e., inter-node) communication using MPI (i.e., if we are dealing with a vector multiprocessor multicomputer such as the Earth Simulator), then the only appreciable memory-reference concurrency comes from really large messages, if the application allows them. Any application with significant global communication and small messages running on such a system will suffer from latency disease.

In the same way, a scalar uniprocessor or multiprocessor multicomputer (i.e., a parallel system that performs global communication with MPI) -- even one with exceptional global bandwidth -- still has some of the performance properties of a classical MPP. For decent performance, we must still both localize the computation as much as possible and keep inter-node messages either rare or large. People seem to forget what made the Cray T3E special: it had decent bandwidth (for its day) _and_ it had external logic (the E-registers) that allowed the Alpha processor to have a reasonable number of outstanding loads (compared to a conventional scalar processor).

Given an application's need for significant global communication, we must have both high hardware bandwidth to global memory and high memory-reference concurrency to global memory to sustain that hardware bandwidth. A system is _globally balanced_ if it can sustain its global hardware bandwidth.

Processors differ in having or not having scalable latency-hiding mechanisms to sustain reasonable performance on nonlocal applications. This doesn't concern you if all your applications are completely local. But suppose the supercomputing market, as a result of government intervention, evolves in a positive direction, and processors with latency-hiding mechanisms become attractive to a broader market and are manufactured in larger numbers. Broad appeal and high volume are what make something a "commodity" component. Logically speaking, the architectural feature of a processor's being able to sustain high memory bandwidth is _orthogonal_ to whether that processor has broad appeal and high volume. Assuming a necessary distinction between the core execution models of custom and commodity processors makes sense only if you have rigid (and pessimistic) views of how the supercomputing market must necessarily evolve, even in the best-case scenario where the government assumes primary responsibility for the problem of supercomputing.

Your correspondent suggests that a sustainable future for supercomputing will be further enabled by vigorous efforts to break down the distinctions between market-driven, low-bandwidth commercial supercomputing and government-driven, high-bandwidth national-security supercomputing -- between what today we call commodity and custom supercomputing. (Both scientific and industrial supercomputing have a broad range of requirements spanning these two extremes). In short, we seek a diverse supercomputing market, with a balanced mix of low-bandwidth and high-bandwidth applications in _each_ community, across the broadest possible range of user communities.

This goal is obtainable for two reasons: 1) forceful government action can change the supercomputing market for the better, and 2) many scientific and industrial (and even some traditional commercial) customers will find it increasingly difficult to meet their supercomputing needs with conventional systems. We can see this by projecting current technology trends. In a word, the intense pain felt by the national-security community today will become much more widely shared as nonuniform technology scaling makes today's locality, which will not scale as hoped, no longer a trustworthy source of tomorrow's needed performance.

To repeat, the dichotomy between market-driven commercial computing and government-driven national-security computing is not set in stone, but rather is a function of the value placed on high-bandwidth computing by the broader supercomputer market. This valuation itself is not set in stone or governed by necessary rules, but rather can be modified over time by appropriate government intervention, possibly including legislation that constrains what computer vendors must do. Moreover, the current addiction to locality as the only source of performance will be severely tested -- as we move forward -- by nonuniform technology scaling, even by users who are complacent today ("If it ain't broke, don't fix it").

What Do Current Trends Portend?

Most trends in high-performance computing are consequences of nonuniform performance scaling among various components. The NRC report notes: "In particular, the arithmetic performance increases much faster than the local [or] global bandwidth of the system". Both local and global latency, "when expressed in terms of the instructions [that could be] executed in the time it takes to communicate to local [or global] memory", are increasing rapidly.

Nonuniform scaling of technology "poses a number of challenges for supercomputer architecture, particularly for those applications that demand high local or global bandwidth". In particular, what is tolerable today may not be tolerable tomorrow. "For example, if processor speed increases but [system] interconnect is not improved, then global communication may become a bottleneck. At some point, parametric [i.e., just letting technology scaling happen,] evolution breaks down and qualitative changes to hardware and software are needed".

The trends are stark. "The divergence of memory speeds and computation speeds ... will ultimately force an innovation in architecture. By 2010, 170 loads will need to be in flight at the same time to keep [local-]memory bandwidth busy while waiting for memory latency, and 1,600 floating-point arithmetic operations can be performed during this time. By 2020, 780 loads must be in flight, and 94,000 arithmetic operations can be performed while waiting on memory. These numbers are not sustainable". Indeed, "it is clear that systems derived using simple parametric evolution are already greatly strained and will break down completely by 2020". Your correspondent would modify one quotation to read: "Changes in 1) processor and system architectures, and in 2) programming languages and systems, are required to hide large amounts of latency with parallelism and _also_ to enhance the locality of computations".

But are commodity processors and/or interconnects, if market drivers of technological innovation continue as they are, likely to make great strides in either latency-hiding and/or locality-enhancing mechanisms? (Nota bene: architectures underlie languages). If not, an even smaller fraction of all scientific applications -- compared to the fraction that "gets by" today -- will find systems optimized for low-bandwidth commercial applications suited to their needs. As the processor-memory performance gap scales, you need to scale proportionally either the amount of exploitable locality or the ability to tolerate latency. In the general case, i.e., for something other than an embarrassly localizable application, you need to do both. A general-purpose parallel computer must be able to abstract performance from an appropriate mix of both parallelism and locality to deal with applications of different types.

Scaling locality is the tougher problem. Since we need all the performance help we can get, design of new parallel programming languages that allow programmer specification of locality -- without falling into the MPI trap of forcing the programmer to specify everything -- is required. But the fundamental optimization decision underlying high-bandwidth systems is still correct: You should make latency tolerance your performance workhorse and then exploit whatever locality you can get your hands on.

What The Government Must Do

Observing evolutionary trends that make the status quo unsustainable, the committee wants the government to _force_ innovation so that scaling can continue. "The growing gap between processor performance and global bandwidth and latency is also expected to [require] innovation. By 2010, global bandwidth will fall to 0.008 words/flop and latency will require 8,700 flops to cover. These numbers are problematic for all but the most local of applications. To overcome this global communication gap requires innovation in architecture to provide more bandwidth and lower latency and in programming [languages and] systems, and applications, to improve locality".

"Significant investments in both basic and applied research are needed now to lay the groundwork for the innovations that will be required over the next 15 years to ensure the viability of high-end systems". Even low-end systems "will eventually run out of steam without such investments".

"Given that leadership in supercomputing is essential to the government, that supercomputing is expensive, and that market forces alone will not drive progress in supercomputing-directed technologies, it is the role of government to ensure that supercomputing appropriate to our needs is available both now and in the future. That entails both having the necessary activities in place in an ongoing fashion and providing the funding to support those activities". "Progress in supercomputing depends critically on a sustained investment by the government in basic research, in prototype development, in procurement, and in ensuring the economic viability of suppliers".

All this goes without saying. The only thing that requires further thought is how to implement a _supercomputing roadmap_. We quote the pertinent recommendation: "The government agencies responsible for supercomputing should underwrite a community effort to develop and maintain a roadmap that identifies [the] key obstacles and synergies in all of supercomputing".

However, when the committee speculates on some possible outcomes of the roadmap process, their thinking is dreadfully conventional. A clear roadmap is required to anchor any integrated plan for federal investment. It must identify the roadblocks rather than the solutions. Surprisingly, the committee's roadmap speculations come dangerously close to specifying the solutions.

This is nonsense. The critical first sentence of the roadmap should read: "In order to level the playing field, we hereby declare our interest in any innovative technological solution to any of the following fundamental problems, which we have identified as major roadblocks that threaten the continued viability of supercomputing". For example, we are obligated to articulate the basic communication and synchronization problems that must be solved for supercomputing to advance. However, we must _not_ specify in the roadmap any of the solutions, such as: "We probably need full-custom systems for these applications". That may be true but it is not the purpose of a roadmap. Rather, we identify and prioritize fundamental problems, requirements, and objectives, all the while keeping an open mind as to what the solutions might be. DARPA's HPCS program has been very good in doing this.

A supercomputing roadmap for 2005 might include the following problems: How do we increase local and global bandwidth? How do we increase processor and system parallelism? How do we increase processor and system locality? How do we design parallelism and locality mechanisms that do not interfere with one another? How do we program our high-performance machines? What programming model is required to increase programmer productivity? What is an integrated solution to both latency disease -- which includes locality enhancement as a subproblem -- and programming-difficulty disease? Of course, as the name implies, a roadmap is more than just a list of problems. (See the full NRC report for details).

When DARPA was funding basic research in parallel computing in the 1980s and 1990s, it certainly set research priorities (and dismissed some lines of investigation as fruitless), but it also leveled the playing field in that it allowed many different ideas to compete. We have to get DARPA back into the game of supporting basic research in supercomputing. The HPCS program is excellent, but its goal is not to provide broad support for basic research in supercomputing. Of course, the current HPCS teams are engaged in at least some research activity. Still, your correspondent wishes that _more_ basic research could be funded within the confines of the HPCS program; it would give us something now.

But we also need something like HECRTF's Joint Program Office -- to use the original IHEC term -- to assume primary responsibilty for developing and maintaining the supercomputing roadmap. This may be a hard task but without a set of fundamental supercomputing problems agreed upon by all the relevant federal agencies, there can be no integrated federal investment plan to solve them. The difficulty today is that, when people focus on supercomputing at all, they address problems that are without interest -- for the most part.


The committee is basically correct in their recommendation that the government must assume primary responsibility for the problem of supercomputing. Their task in writing this report was essentially to marshal irrefutable arguments to rebut those groups who -- either through regrettable ignorance or through masked commercial interest -- deny that supercomputing is in deep trouble. Breaking the impasse in this _ideological war_ is a key prerequisite for getting both government leadership and sustained federal funding.

Reduced to its simplest terms, the committee's argument runs as follows: 1) The government has an irreducible core of high-bandwidth national-security applications that it must protect. 2) While the majority of computer vendors offer low-bandwidth, locality-dependent supercomputers, only a few offer high- bandwidth, parallelism-dependent supercomputers -- because current commercial market forces drive most computer vendors' decisions _not_ to invest in supercomputing-directed research and development. 3) Basic and applied research in supercomputer-enabling technologies is at a historic low; this research shortfall puts the long-term viability of both high-end and low-end parallel systems at risk. High-bandwidth systems are (currently) at greatest risk because the (current) market for such systems is too small. 4) Simple extrapolation of the observed nonuniform performance scaling across distinct component technologies demonstrates that the status quo is not sustainable: absent significant innovations in processor architecture, system-interconnect component technologies and topologies, system software, programming languages, etc., etc., "parametric evolution of [conventional computer systems] is unsustainable, and current machines have already moved into a problematic region of the design space".

What needs to be added to the committee's report? Basically, the argument from nonuniform performance scaling contradicts the committee's prophecy that low-bandwidth applications will inevitably dominate the supercomputing market, now and forever. Consider that national-security computing is similar to emergency surgery while scientific and industrial computing is similar to elective surgery. Elective supercomputing means that most user communities downsize their computing goals and objectives to match the systems they can afford to procure and program.

Now, consider a scientific application that is just "getting by". That is to say, by careful programming, sufficient spatial and temporal locality has been obtained so that the absence or weakness of latency-hiding mechanisms does not cause the application to suffer overmuch from latency disease.

But let the so-called "processor-memory performance gap" (where is the interconnect in this phrase?) grow sufficiently and suddenly the scientific application is no longer getting by at all. In fact, either scaling of the application or scaling of the technology tends to increase the relative performance dependence on parallelism rather than on locality (because, in general, heroic locality scaling is impossible).

What this means is that, over time, an application that was (barely) suited to a low-bandwidth, locality-dependent system becomes _more suited_ to a high- bandwidth, parallelism-dependent system (or at least to a machine that has more of the attributes of such a system).

Your correspondent sees the possibility of a virtuous circle here. Government leadership can ensure the availability of high-bandwidth machines, simply by ensuring that sufficient research and development is done on their required component technologies. This makes it possible in principle to increase the supply of high-bandwidth systems. Because of nonuniform performance scaling, low-bandwidth applications inevitably change over time to take on more of a high-bandwidth character. Any increased supply of high-bandwidth machines would cause at least some user communities to dust off their postponed (possibly unwritten) high-bandwidth applications, which they had never dreamed of being able to run. The rational economics here is not price/performance but rather total cost of ownership measured against the increased utility of faster solutions (or solutions obtained for the first time).

Low-bandwidth systems evolve to take on more of a high-bandwidth character as the supercomputer market shifts. Finally, even some traditional commercial user communities imagine and exploit high-bandwidth applications to add value to their companies. What began as a federal initiative, with planning, funding, and legislation, gains traction in the private sector as more and more user communities see definite value in high-bandwidth systems. Supercomputing for the few has become supercomputing for the many.

There is a third potential misconception here. If supercomputing isn't over today, it won't be over in 15 years, or after _any_ finite interval of time. Supercomputer design, indeed, all of computer architecture, is a never-ending story, not of incremental improvements but of major innovations. There is no point in achieving healthy supercomputer diversity -- in contrast to the uniformly stagnant status quo -- by precious government intervention only to sink back into a new unity (i.e., a new complacency). But what might easily happen is that today's exceptional supercomputing becomes tomorrow's ordinary supercomputing. It is foolish to prophesy that high-parallelism processors will never move into mid-range systems (or even low-end systems). However, if the distinction between (current) high-bandwidth systems and (current) low- bandwidth systems begins to soften -- if this means anything more than raising the bar for what constitutes a high-bandwidth system, then we will need to reinvent this distinction (or some other).

Enough of this idealism. Take home the minimal message: if the government takes charge of the problem of supercomputing, it could do good things.

The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous. He alone bears responsibility for these commentaries. Replies are welcome and may be sent to HPCwire editor Tim Curns at tim@hpcwire.com.

Top of Page

Previous Article   |  Table of Contents  |