HPCwire
 The global publication of record for High Performance Computing / June 4, 2004: Vol. 13, No. 22

Previous Article   |  Table of Contents  |  

Features:

REVITALIZING HECRTF: A FOCUSED PLAN FOR HIGH-END COMPUTING
Commentary from the High-End Crusader

The administration's High-End Computing Revitalization Task Force (HECRTF) recently issued the public version of its final report, entitled "Federal Plan for High-End Computing". One hesitates to criticize any plan whose HEC Research and Development component reads: "The Task Force recommends first and foremost a coordinated, sustained research program over 10 - 15 years to overcome major technology barriers that limit effective use of high-end computer systems". After all, this is a very good thing. Even so, the HECRTF plan is fundamentally ambiguous and lacks both a sense of urgency and any real commitment to significant change. We---and Congress---deserve something better.

Of course, credit should be given for achieving a consensus of sorts among a disparate group of federal agencies and for getting OSTP/OMB to support as much as they did, even if the actual funding amounts in the original task-force version were carefully excised from the public version. One can even argue that the proper role of the Task Force was merely to endorse the technical content of the CRA-managed "Workshop on: The Roadmap for the Revitalization of High-End Computing" rather than deeply assimilating it and relating it to the specific recommendations in the federal plan.

Nevertheless, it seems prudent to ask whether the vagueness of the federal HECRTF plan, at all but the highest levels of abstraction, is so pervasive as to effectively render this federal plan a _non-plan_. For this purpose, we propose to compare the federal HECRTF plan to two earlier reports: "Report on: High-Performance Computing for the National-Security Community" and "Workshop on: The Roadmap for the Revitalization of High-End Computing".

The federal HECRTF plan has many admirable elements. Among these are: its emphasis on federal funding for HEC R&D; its argument for leadership-class systems; its push to make high-end computing available to the "have nots"; its support of rational benchmarking; and its concrete suggestions for procurement reform. We all agree that these are very good things.

Parts of the executive summary are refreshingly clear. "In the early 1990s, the Federal government adopted a strategy of pursuing high-end computing capability based on systems built from commercial-off-the-shelf (COTS) components. In the absence of clear evidence against this strategy, the promise of high aggregate performance at relatively low cost made procurement of COTS-based systems a sensible and appropriate course of action. We now have evidence that there are applications of national importance that would benefit significantly from an alternative to COTS-based solutions. Therefore, research and development efforts in alternative architectures and enabling technologies are needed to ensure U.S. leadership in high-end computing". This is fine.

The federal HECRTF plan has three primary components: 1) Standing up a coordinated, sustained research and development program in these alternative architectures and enabling technologies; 2) Providing high-end computing resources across the full range of critical federal missions, including making HEC available to "have nots", dealing with the oversubscription of current resources, and standing up systems powerful enough to solve many important large-scale problems; and finally 3) Setting up several pilot projects to rationalize the federal procurement process, which variously involve more rational benchmarks, better cost models, and new approaches to sharing procurement processes across agencies. Again, at this level of abstraction, everything is fine.

  • The IHEC Plan

The fundamental goal in the "National-Security Community" report is to rebuild and sustain a strong industrial base in high-end supercomputing. To this end, the report recommended a multi-element program, called the Integrated High-End Computing (IHEC) Program, comprising the following elements: applied research, advanced development, and engineering and prototype development. (We might add test and evaluation as a separate element).

The _applied research_ element focuses on developing the fundamental concepts in high-end computing and creating a pipeline of new ideas and graduate-level expertise. The _advanced development_ element focuses on selecting and refining innovative technologies and architectures for potential integration into high-end systems. The _engineering and prototype development_ element focuses on building operational prototypes and system-level testbeds.

Furthermore, high-end computing laboratories are needed. These laboratories will fill a critical capability gap in testing system software on dedicated large-scale platforms, supporting the development of software tools and algorithms, developing and advancing benchmarking and modeling, simulating system architectures, and conducting detailed technical requirements analysis.

Underwriting only that research, development, and engineering that industry will not conduct, the program has a base option slowly reaching $110 million per year and a progressive-level program eventually requiring $280 million per year. The IHEC program is to be executed by a Joint Program Office and staffed by the participating national-security agencies. High-end supercomputing procurements were not included as an element within this program.

The IHEC report defines an HEC R&D agenda when it observes (pp. 37-38) that, driven by distinct application requirements in distinct segments of the supercomputer market, there are two very different approaches to building high-capability systems. The bulk of the high-performance market, perhaps 98%, is occupied by mid-range servers that have been optimized to suit commercial applications such as transaction processing and information retrieval. These mid-range servers are, technically speaking, balanced _low-bandwidth_ systems in which weakly parallel processors are coupled with low-bandwidth global system interconnects. These balanced low-bandwidth system architectures more than adequately match the application requirements of most commercial applications.

Weakly parallel, i.e., conventional, processors have also been called latency-intolerant processors by fans of multithreaded multiprocessors, and processors unable to sustain large numbers of outstanding memory references by fans of vector supercomputers. They were first characterized by Burton Smith in a 1990 talk "The End of Architecture".

Smith wrote, "What's wrong with [conventional] processors? What architectural shortcoming is preventing their use in general-purpose parallel machines? The answer is straightforward: [their] inability to tolerate unpredictable fine-grained latency from any source, notably from the use of shared memory or from fine-grained synchronization". (Since weakly parallel processors do not sustain a high rate of communication requests, they are easily balanced by a low-bandwidth global system interconnect).

Balanced low-bandwidth systems do not scale well to large configurations for applications that engage in significant amounts of long-range communication: lacking a thoroughgoing mechanism to tolerate the latency of long-range communication requests, weakly parallel processors spend most of their time waiting for data to operate on.

This is the stark reality the HECRTF plan delicately refers to (p. 13) in one of its key assertions: " With industry focused on the lucrative market for servers, ..., the HEC resources provided by industry have consisted of very large collections of processors designed for smaller systems in the server market. Unfortunately, these massive multiprocessor systems have proven exceptionally difficult to program, and achieving high levels of performance for some important classes of applications has been problematic".

What _would_ be required to achieve high levels of performance on these important classes of applications are balanced _high-bandwidth_ systems, in which strongly parallel processors (think vectors or multithreading) are coupled with high-bandwidth global system interconnects. There is an important asymmetry here. The size of the mid-range server market easily supports the non-recurring engineering costs of developing large-scale balanced _low-bandwidth_ systems. The fundamental problem in high-end computing is that, taken together, the current size of the high-end server market plus the very modest current federal investments in high-end computing are _barely able_ to support the non-recurring engineering costs of developing large-scale balanced _high-bandwidth_ systems.

The IHEC report tries to provide _supercomputing leadership_ because it is deeply worried about the inadequate supply of large-scale high-bandwidth systems to the national-security community. The authors stress the immense engineering design efforts required to develop and periodically refresh appropriate strongly parallel processors and appropriate high-bandwidth global system interconnects. (In straightforward designs, high-bandwidth systems can use COTS memory components). The authors state (pp. 38-39): "[I]ndeed, for some memory and communications-bound applications, [high-bandwidth systems] are the only platforms that can execute the task. Further[more], programmers can quickly achieve a substantial fraction of the ultimate potential performance of an application on this type of machine without heroic optimization efforts".

Two vexing questions are, what economic model would support the existence of this type of high-end machine, and will (almost) no such machines be produced absent government sharing in the non-recurring engineering costs, as well as the government expanding its role as a major customer?

It's not just national security that is at risk. The federal HECRTF plan rightly has a very broad-based conception of high-end computing as a strategic tool for science and technology leadership, drawing on motivating application examples in physics, nanotechnology, aerospace, life sciences, national security, earth and atmospheric sciences, and energy and the environment; it makes the valid point that high-end computing supports a very wide variety of unclassified and classified applications.

  • The Roadmap Vision

The "Roadmap" report expends considerable effort articulating the concept of a _custom-enabled architecture_. With almost all vendors comfortably settled into providing COTS-based systems, we need the full concept both to push further with next-generation custom-enabled architectures and to understand if there are intermediate points in the design space between pure COTS-based and pure custom-enabled systems, perhaps intermediate-bandwidth systems.

There are many questions. What are the key concepts in high-end computing? How do certain hardware mechanisms affect scalability and ease of programming? What does it mean to balance a processor with a given degree of parallelism? What is the fundamental difference between memory bandwidth and global system interconnect bandwidth? How do new device technologies allow one to create innovative structures with attributes that generalize conventional notions of parallelism and bandwidth? What are the true sources of performance degradation in conventional architectures, and what hardware mechanisms would be required to address each one? How do new parallel execution models affect the programmer's responsibility to perform global management of concurrent tasks and parallel resources? And finally, why are new parallel programming languages urgently required? (Answer: for productivity!).

Here are some of the fundamental opportunities enabled by custom architecture. The low spatial and power cost of VLSI floating-point arithmetic and other functional units allows the design of function-intensive structures (floating-point multipliers everywhere). This opens up two distinct design paths. On the one hand, we might explore clustered microarchitectures in which local register files are close to arithmetic functional units in order to exploit producer-consumer (intermediate-result) temporal locality. On the other hand, we might explore PIM-subsystem architectures in which portions of the computation alternate between computing on data spatially local to a thread and migrating a thread over long distances to reach new spatially local data on which to compute.

Indeed, since classical custom-enabled architectures have tended to downplay locality, we need to take a comprehensive look at various new forms of enhanced locality. New locality mechanisms could be as simple as designing a new compiler-managed cache architecture or as complex as designing the control mechanisms for shipping threads to data in PIM-subsystem architectures.

The largest single opportunity enabled by custom architecture is exceptional global bandwidth. If you accept the conventional wisdom that having a local memory is always a good thing, you probably distinguish between the _memory bandwidth_ between the processor and its local memory, and the global system _interconnect bandwidth_ between the processor and the remote memories. On the other hand, if you hash all your system memory to avoid memory contention (hot spots), then there is really only global interconnect bandwidth as a significant independent variable. Global bandwidth depends above all on the interconnect architecture and the capabilities of its component routers, but also on the speeds with which one can move data between processors and the network, and between memories and the network.

For demanding applications, global bandwidth is the primary bounding condition (upper limit) on system capability. Such bandwidth is required to tolerate the latency of long-distance communication. (Ignore temporarily the very real communication needs at shorter space scales). Global bandwidth is also the system resource that is the most expensive to provide and the most difficult to engineer correctly (if you are aiming at exceptional bandwidth). For this reason, the decision on how much global bandwidth to provide for a given system is probably the single most important design decision for that system.

Of course, there is no point in providing exceptional global bandwidth unless you have some hardware mechanism that uses this bandwidth well to provide exceptional performance for demanding applications. The mechanism of choice is strongly parallel processors, which support many outstanding communication requests at the same time. Always assuming the presence of long-range communication, and assuming further that communication overhead has been minimized, any inadequate performance on an application has only _two_ possible sources: either the global bandwidth is too low, and the performance degradation is due to the fact that the system is bandwidth bound; or the global bandwidth is high but the processor parallelism is too low, and the performance degradation is due to the fact that the system is parallelism bound. Balance is matching the parallelism of your processors to the bandwidth of your network.

One might argue that this formulation is insufficiently general for next-generation custom-enabled architectures. The Cray Cascade supercomputer uses its global bandwidth both for frequent long-range communication (loads and stores) and for less-frequent long-range thread migration (remote thread creation). The compiler-determined choice between these two communication strategies depends on whether a subcomputation displays _high_ temporal locality, in which case its operations are marshaled into heavyweight threads for execution on "large-state" compute-intensive processors, or _little or no_ temporal locality (but possibly some spatial locality), in which case its operations are marshaled into lightweight threads for execution on PIM processors.

Any locality mechanism reduces the need for long-range communication, thereby relieving some of the pressure to tolerate its latency. In Cascade, there are two distinct locality mechanisms. In every compute-intensive ("heavyweight") processor, there are large amounts of register and cache storage to allow heavyweight threads to accumulate considerable thread state. In the standard way, temporal locality is used in heavyweight processors to relieve some of the pressure on processor parallelism ("latency-tolerance") mechanisms. In contrast, each PIM ("lightweight") processor has unusual bandwidth to on-chip memory to allow lightweight threads that have migrated there to take advantage of as much (local) computational state as happens to be nearby. In the _same way_, spatial locality is used in lightweight processors to relieve some of the pressure on _their_ processor parallelism mechanisms.

Nonetheless, processor parallelism is still required to tolerate the latency of both long-range communication and long-range thread migration. Locality, whether spatial or temporal, is at best a mitigating factor (although a very useful source of extra performance). The only asymmetry is that we readily agree to tolerate frequent long-range communication (if necessary) but limit the frequency of thread migration (for performance reasons) by compiler estimation of spatial locality and careful choice of hash-block size.

The "Roadmap" report provides examples of innovative custom architectures that exploit one or more of the potential opportunities discussed above.

The first example is spatially direct-mapped architectures that closely match the intrinsic control flow and data flow of the application kernel computation to the structures of functional units and their interconnection paths. Such "reconfigurable logic" could be generalized to create an adaptive system in which the changing needs of a computation---expressed in its program code--- could help control resource management, task scheduling, and load balancing, all performed at runtime by the architecture and the runtime system, thereby removing a significant burden from the programmer.

The report views vectors and streaming as promising complementary hardware mechanisms to multithreading, sometimes exploited as parallelism mechanisms and sometimes as locality mechanisms. It also discusses processor-in-memory (PIM) architecture, explaining how the arithmetic logic units are tightly coupled to the memory row buffers, which allows exceptionally high-bandwidth access.

Like the "National-Security Community" report, the "Roadmap" report focuses on enabling and exploiting global bandwidth. Global networks for future HEC custom systems must exhibit a bisection bandwidth that is at least an order of magnitude greater than conventional systems, and will probably employ advanced technologies, such as high-speed signaling. Advanced network structures must be created. High-radix networks organized in small-buffering technologies will be deployable within a few years. A combination of processor parallelism mechanisms, including streams, vectors, and multithreading, will provide the large number of simultaneous in-flight communication requests per processor that are necessary to make good use of these enhanced global network resources.

  • The Federal HECRTF Report

The goals of this report are: 1) Make high-end computing easier and more productive to use; 2) Foster the development and innovation of new generations of high-end computing systems and technologies; 3) Effectively manage and coordinate federal high-end computing; and 4) Make high-end computing readily available to federal agencies that need it to fulfill their missions.

The core of the plan is a federally-funded HEC research and development initiative that is billed as "a coordinated, sustained research program over 10 - 15 years to overcome major technology barriers that limit effective use of high-end computer systems". Yet neither the financial commitment nor the research agenda is there. In what should have been the heart and soul of the document (pp. 13-20) on HEC research and development, all one finds is a vague sense that there are a few things here and there that need fixing, but nothing to get excited about.

The superficial overview of hardware technology challenges is embarrassing. As for software technology challenges, the Task Force flirts with the idea (p. 6) that a "common system software base [might] deliver needed improvements in sustained application performance, ...". This just isn't serious. Architectures barely deserve mention among the systems technology challenges. The hardware, software, and systems roadmaps (pp. 18-20) are hardly better.

In an appendix (p. 66), the Task Force considers establishment of a Joint Management Office (JMO) to oversee the R&D portfolio. The Task Force writes, "The Office would conduct integrated planning, solicitation preparation, project selection and execution, and progress reviews, ...". This is clearly the right governance structure for a high-end computing program. One critical question remains: How do we ensure that the Joint Management Office will provide vigorous leadership in high-end computing when the HECRTF Task Force has shown itself incapable of providing any leadership at all?

  • Towards A Focused Plan

The first observation is that DARPA's High-Productivity Computing Systems (HPCS) program enables research, development, and engineering that otherwise would not be conducted: it comes to fill an _enormous_ need. Moreover, it is a polite fiction to claim that HPCS is only "bridge" funding until we get to quantum computing; more realistically, we should ask, who will fund the research required for the machine we need at the end of the _next_ decade? The first fundamental weakness of HECRTF is the absence of a long-term research component. The HEC R&D pipeline must be properly stoked for the foreseeable future. The federal government must make a _visible financial commitment_ to HEC R&D well beyond the expiration of the HPCS program in 2010.

The second observation is that no non-incremental research and development plan is meaningful without a clearly defined research agenda; the governance structure for a high-end computing program must include mechanisms to ensure effective leadership in the field of high-end computing. The track record of the federal government indicates that it does not speak with a consistent voice and cannot internally generate this leadership. The second fundamental weakness of HECRTF is that it neither proposes a suitable research agenda nor considers mechanisms adequate to generate such an agenda.

Who sets the agenda for HEC R&D? Presumably, the high-end users at mission agencies, government labs, large industrial corporations, major research centers, etc., articulate their high-end computing needs, i.e., the computational requirements of their most-critical applications. Then, computer vendors---the people with experience in building computer systems--- look at their technology roadmaps and business plans and decide which of these needs they are willing to address. Academic experts and government officials can mediate and/or facilitate communication.

Both the IHEC report and the "Roadmap" report indicate that high-end computing needs can be articulated and that integrated planning can be performed. But the HECRTF report demonstrates that the federal government has not--- apparently---agreed on a mechanism whereby an informed group of people (users, builders, experts, officials) can get together and set the agenda. In fact, HECRTF expurgates several promising detailed agendas from previous reports.

What are some of the research priorities your correspondent favors? An easy pick is, explore computing substrates to replace silicon in the sunset years of Moore's law.

Perhaps the most pressing needs lie in the general area of computer systems: computer architectures, including the architectures of computer global system interconnection fabrics; programming languages and systems; and system software, including especially operating systems (and their instability in large-scale parallel machines).

Much has already been said in support of balanced high-bandwidth computer architectures. Still, your correspondent recommends a special push in the area of global system interconnects. New device technology (billions of transistors) enables new intelligent routers that in turn enable new interconnect architectures. Slightly further out, we will see new technology in place of the old (e.g., optical interconnects and switches in place of their electrical counterparts). But nobody has really pushed this area because of shortfalls in processor parallelism (or inadequate financial resources). Your correspondent would like to see a truly large-scale global system interconnect with _exceptional_ as uniform as possible bisection bandwidth, say, one suitable for use in a balanced petaflops-scale system. An interconnect with PBs/s of bisection bandwidth has never been designed, much less built.

Another old problem that has suddenly become new and pressing is, design a very high-level architecture-specific shared-memory parallel programming language that reconciles programmability and performance. This will overcome the two barriers to productivity. Why architecture specific? Well, a language designer can conceive of new parallel programming abstractions that ease the programming burden. However, architecture and language are inextricably linked, and we need to design _specific_ hardware mechanisms in order to support these programming abstractions with high performance. Most production parallel programming languages presuppose a machine model, often an archaic one. It is time to accept the interdependence of architecture and language design. This is another area where vendor participation is essential.

Outside of high-end computing specifically, your correspondent prioritizes information-systems security, computer-aided reasoning for large-scale information management, and complexity management of computing systems architectures at all levels (from circuits to global-scale distributed systems).

It may be that high-end computing, and more generally computing itself, is traversing a temporary fall-off of enthusiasm. The intellectual excitement is less because the field has ceased to focus on its real problems. The cure? Articulate the innovative visions and grand challenges in computing that alone can give some of the deep satisfactions had by people working in the physical (and biological) sciences. Computer-systems research has just begun.

Computing, and high-end computing, will excite people when they have regained their focus on the deep computer-science problems that really matter.


The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous. He alone bears responsibility for the opinions here. Comments are always welcome and may be sent to HPCwire editor Tim Curns at tim@hpcwire.com.


Top of Page

Previous Article   |  Table of Contents  |