
Features:
REVITALIZING HECRTF: A FOCUSED PLAN FOR HIGH-END COMPUTING
Commentary from the High-End Crusader
The administration's High-End Computing Revitalization Task Force (HECRTF)
recently issued the public version of its final report, entitled "Federal
Plan for High-End Computing". One hesitates to criticize any plan whose
HEC Research and Development component reads: "The Task Force recommends
first and foremost a coordinated, sustained research program over 10 - 15
years to overcome major technology barriers that limit effective use of
high-end computer systems". After all, this is a very good thing. Even so,
the HECRTF plan is fundamentally ambiguous and lacks both a sense of urgency
and any real commitment to significant change. We---and Congress---deserve
something better.
Of course, credit should be given for achieving a consensus of sorts among a
disparate group of federal agencies and for getting OSTP/OMB to support as
much as they did, even if the actual funding amounts in the original
task-force version were carefully excised from the public version. One can
even argue that the proper role of the Task Force was merely to endorse the
technical content of the CRA-managed "Workshop on: The Roadmap for the
Revitalization of High-End Computing" rather than deeply assimilating it and
relating it to the specific recommendations in the federal plan.
Nevertheless, it seems prudent to ask whether the vagueness of the federal
HECRTF plan, at all but the highest levels of abstraction, is so pervasive as
to effectively render this federal plan a _non-plan_. For this purpose, we
propose to compare the federal HECRTF plan to two earlier reports: "Report on:
High-Performance Computing for the National-Security Community" and "Workshop
on: The Roadmap for the Revitalization of High-End Computing".
The federal HECRTF plan has many admirable elements. Among these are: its
emphasis on federal funding for HEC R&D; its argument for leadership-class
systems; its push to make high-end computing available to the "have nots"; its
support of rational benchmarking; and its concrete suggestions for procurement
reform. We all agree that these are very good things.
Parts of the executive summary are refreshingly clear. "In the early 1990s,
the Federal government adopted a strategy of pursuing high-end computing
capability based on systems built from commercial-off-the-shelf (COTS)
components. In the absence of clear evidence against this strategy, the
promise of high aggregate performance at relatively low cost made procurement
of COTS-based systems a sensible and appropriate course of action. We now
have evidence that there are applications of national importance that would
benefit significantly from an alternative to COTS-based solutions. Therefore,
research and development efforts in alternative architectures and enabling
technologies are needed to ensure U.S. leadership in high-end computing".
This is fine.
The federal HECRTF plan has three primary components: 1) Standing up a
coordinated, sustained research and development program in these alternative
architectures and enabling technologies; 2) Providing high-end computing
resources across the full range of critical federal missions, including making
HEC available to "have nots", dealing with the oversubscription of current
resources, and standing up systems powerful enough to solve many important
large-scale problems; and finally 3) Setting up several pilot projects to
rationalize the federal procurement process, which variously involve more
rational benchmarks, better cost models, and new approaches to sharing
procurement processes across agencies. Again, at this level of abstraction,
everything is fine.
The fundamental goal in the "National-Security Community" report is to rebuild
and sustain a strong industrial base in high-end supercomputing. To this end,
the report recommended a multi-element program, called the Integrated High-End
Computing (IHEC) Program, comprising the following elements: applied research,
advanced development, and engineering and prototype development. (We might
add test and evaluation as a separate element).
The _applied research_ element focuses on developing the fundamental concepts
in high-end computing and creating a pipeline of new ideas and graduate-level
expertise. The _advanced development_ element focuses on selecting and
refining innovative technologies and architectures for potential integration
into high-end systems. The _engineering and prototype development_ element
focuses on building operational prototypes and system-level testbeds.
Furthermore, high-end computing laboratories are needed. These laboratories
will fill a critical capability gap in testing system software on dedicated
large-scale platforms, supporting the development of software tools and
algorithms, developing and advancing benchmarking and modeling, simulating
system architectures, and conducting detailed technical requirements analysis.
Underwriting only that research, development, and engineering that industry
will not conduct, the program has a base option slowly reaching $110 million
per year and a progressive-level program eventually requiring $280 million
per year. The IHEC program is to be executed by a Joint Program Office and
staffed by the participating national-security agencies. High-end
supercomputing procurements were not included as an element within this
program.
The IHEC report defines an HEC R&D agenda when it observes (pp. 37-38) that,
driven by distinct application requirements in distinct segments of the
supercomputer market, there are two very different approaches to building
high-capability systems. The bulk of the high-performance market, perhaps
98%, is occupied by mid-range servers that have been optimized to suit
commercial applications such as transaction processing and information
retrieval. These mid-range servers are, technically speaking, balanced
_low-bandwidth_ systems in which weakly parallel processors are coupled with
low-bandwidth global system interconnects. These balanced low-bandwidth
system architectures more than adequately match the application requirements
of most commercial applications.
Weakly parallel, i.e., conventional, processors have also been called
latency-intolerant processors by fans of multithreaded multiprocessors, and
processors unable to sustain large numbers of outstanding memory references by
fans of vector supercomputers. They were first characterized by Burton Smith
in a 1990 talk "The End of Architecture".
Smith wrote, "What's wrong with [conventional] processors? What architectural
shortcoming is preventing their use in general-purpose parallel machines?
The answer is straightforward: [their] inability to tolerate unpredictable
fine-grained latency from any source, notably from the use of shared memory
or from fine-grained synchronization". (Since weakly parallel processors do
not sustain a high rate of communication requests, they are easily balanced
by a low-bandwidth global system interconnect).
Balanced low-bandwidth systems do not scale well to large configurations for
applications that engage in significant amounts of long-range communication:
lacking a thoroughgoing mechanism to tolerate the latency of long-range
communication requests, weakly parallel processors spend most of their time
waiting for data to operate on.
This is the stark reality the HECRTF plan delicately refers to (p. 13) in one
of its key assertions: " With industry focused on the lucrative market for
servers, ..., the HEC resources provided by industry have consisted of very
large collections of processors designed for smaller systems in the server
market. Unfortunately, these massive multiprocessor systems have proven
exceptionally difficult to program, and achieving high levels of performance
for some important classes of applications has been problematic".
What _would_ be required to achieve high levels of performance on these
important classes of applications are balanced _high-bandwidth_ systems, in
which strongly parallel processors (think vectors or multithreading) are
coupled with high-bandwidth global system interconnects. There is an
important asymmetry here. The size of the mid-range server market easily
supports the non-recurring engineering costs of developing large-scale
balanced _low-bandwidth_ systems. The fundamental problem in high-end
computing is that, taken together, the current size of the high-end server
market plus the very modest current federal investments in high-end computing
are _barely able_ to support the non-recurring engineering costs of developing
large-scale balanced _high-bandwidth_ systems.
The IHEC report tries to provide _supercomputing leadership_ because it is
deeply worried about the inadequate supply of large-scale high-bandwidth
systems to the national-security community. The authors stress the immense
engineering design efforts required to develop and periodically refresh
appropriate strongly parallel processors and appropriate high-bandwidth global
system interconnects. (In straightforward designs, high-bandwidth systems can
use COTS memory components). The authors state (pp. 38-39): "[I]ndeed, for
some memory and communications-bound applications, [high-bandwidth systems]
are the only platforms that can execute the task. Further[more], programmers
can quickly achieve a substantial fraction of the ultimate potential
performance of an application on this type of machine without heroic
optimization efforts".
Two vexing questions are, what economic model would support the existence of
this type of high-end machine, and will (almost) no such machines be produced
absent government sharing in the non-recurring engineering costs, as well as
the government expanding its role as a major customer?
It's not just national security that is at risk. The federal HECRTF plan
rightly has a very broad-based conception of high-end computing as a strategic
tool for science and technology leadership, drawing on motivating application
examples in physics, nanotechnology, aerospace, life sciences, national
security, earth and atmospheric sciences, and energy and the environment; it
makes the valid point that high-end computing supports a very wide variety of
unclassified and classified applications.
The "Roadmap" report expends considerable effort articulating the concept of
a _custom-enabled architecture_. With almost all vendors comfortably settled
into providing COTS-based systems, we need the full concept both to push
further with next-generation custom-enabled architectures and to understand if
there are intermediate points in the design space between pure COTS-based and
pure custom-enabled systems, perhaps intermediate-bandwidth systems.
There are many questions. What are the key concepts in high-end computing?
How do certain hardware mechanisms affect scalability and ease of programming?
What does it mean to balance a processor with a given degree of parallelism?
What is the fundamental difference between memory bandwidth and global system
interconnect bandwidth? How do new device technologies allow one to create
innovative structures with attributes that generalize conventional notions of
parallelism and bandwidth? What are the true sources of performance
degradation in conventional architectures, and what hardware mechanisms would
be required to address each one? How do new parallel execution models affect
the programmer's responsibility to perform global management of concurrent
tasks and parallel resources? And finally, why are new parallel programming
languages urgently required? (Answer: for productivity!).
Here are some of the fundamental opportunities enabled by custom architecture.
The low spatial and power cost of VLSI floating-point arithmetic and other
functional units allows the design of function-intensive structures
(floating-point multipliers everywhere). This opens up two distinct design
paths. On the one hand, we might explore clustered microarchitectures in
which local register files are close to arithmetic functional units in order
to exploit producer-consumer (intermediate-result) temporal locality. On the
other hand, we might explore PIM-subsystem architectures in which portions of
the computation alternate between computing on data spatially local to a
thread and migrating a thread over long distances to reach new spatially local
data on which to compute.
Indeed, since classical custom-enabled architectures have tended to downplay
locality, we need to take a comprehensive look at various new forms of
enhanced locality. New locality mechanisms could be as simple as designing a
new compiler-managed cache architecture or as complex as designing the control
mechanisms for shipping threads to data in PIM-subsystem architectures.
The largest single opportunity enabled by custom architecture is exceptional
global bandwidth. If you accept the conventional wisdom that having a local
memory is always a good thing, you probably distinguish between the _memory
bandwidth_ between the processor and its local memory, and the global system
_interconnect bandwidth_ between the processor and the remote memories. On
the other hand, if you hash all your system memory to avoid memory contention
(hot spots), then there is really only global interconnect bandwidth as a
significant independent variable. Global bandwidth depends above all on the
interconnect architecture and the capabilities of its component routers, but
also on the speeds with which one can move data between processors and the
network, and between memories and the network.
For demanding applications, global bandwidth is the primary bounding condition
(upper limit) on system capability. Such bandwidth is required to tolerate
the latency of long-distance communication. (Ignore temporarily the very real
communication needs at shorter space scales). Global bandwidth is also the
system resource that is the most expensive to provide and the most difficult
to engineer correctly (if you are aiming at exceptional bandwidth). For this
reason, the decision on how much global bandwidth to provide for a given
system is probably the single most important design decision for that system.
Of course, there is no point in providing exceptional global bandwidth unless
you have some hardware mechanism that uses this bandwidth well to provide
exceptional performance for demanding applications. The mechanism of choice
is strongly parallel processors, which support many outstanding communication
requests at the same time. Always assuming the presence of long-range
communication, and assuming further that communication overhead has been
minimized, any inadequate performance on an application has only _two_
possible sources: either the global bandwidth is too low, and the performance
degradation is due to the fact that the system is bandwidth bound; or the
global bandwidth is high but the processor parallelism is too low, and the
performance degradation is due to the fact that the system is parallelism
bound. Balance is matching the parallelism of your processors to the
bandwidth of your network.
One might argue that this formulation is insufficiently general for
next-generation custom-enabled architectures. The Cray Cascade supercomputer
uses its global bandwidth both for frequent long-range communication (loads
and stores) and for less-frequent long-range thread migration (remote thread
creation). The compiler-determined choice between these two communication
strategies depends on whether a subcomputation displays _high_ temporal
locality, in which case its operations are marshaled into heavyweight threads
for execution on "large-state" compute-intensive processors, or _little or
no_ temporal locality (but possibly some spatial locality), in which case its
operations are marshaled into lightweight threads for execution on PIM
processors.
Any locality mechanism reduces the need for long-range communication, thereby
relieving some of the pressure to tolerate its latency. In Cascade, there are
two distinct locality mechanisms. In every compute-intensive ("heavyweight")
processor, there are large amounts of register and cache storage to allow
heavyweight threads to accumulate considerable thread state. In the standard
way, temporal locality is used in heavyweight processors to relieve some of
the pressure on processor parallelism ("latency-tolerance") mechanisms. In
contrast, each PIM ("lightweight") processor has unusual bandwidth to on-chip
memory to allow lightweight threads that have migrated there to take advantage
of as much (local) computational state as happens to be nearby. In the _same
way_, spatial locality is used in lightweight processors to relieve some of
the pressure on _their_ processor parallelism mechanisms.
Nonetheless, processor parallelism is still required to tolerate the latency
of both long-range communication and long-range thread migration. Locality,
whether spatial or temporal, is at best a mitigating factor (although a very
useful source of extra performance). The only asymmetry is that we readily
agree to tolerate frequent long-range communication (if necessary) but limit
the frequency of thread migration (for performance reasons) by compiler
estimation of spatial locality and careful choice of hash-block size.
The "Roadmap" report provides examples of innovative custom architectures that
exploit one or more of the potential opportunities discussed above.
The first example is spatially direct-mapped architectures that closely match
the intrinsic control flow and data flow of the application kernel computation
to the structures of functional units and their interconnection paths. Such
"reconfigurable logic" could be generalized to create an adaptive system in
which the changing needs of a computation---expressed in its program code---
could help control resource management, task scheduling, and load balancing,
all performed at runtime by the architecture and the runtime system, thereby
removing a significant burden from the programmer.
The report views vectors and streaming as promising complementary hardware
mechanisms to multithreading, sometimes exploited as parallelism mechanisms
and sometimes as locality mechanisms. It also discusses processor-in-memory
(PIM) architecture, explaining how the arithmetic logic units are tightly
coupled to the memory row buffers, which allows exceptionally high-bandwidth
access.
Like the "National-Security Community" report, the "Roadmap" report focuses
on enabling and exploiting global bandwidth. Global networks for future HEC
custom systems must exhibit a bisection bandwidth that is at least an order of
magnitude greater than conventional systems, and will probably employ advanced
technologies, such as high-speed signaling. Advanced network structures must
be created. High-radix networks organized in small-buffering technologies
will be deployable within a few years. A combination of processor parallelism
mechanisms, including streams, vectors, and multithreading, will provide the
large number of simultaneous in-flight communication requests per processor
that are necessary to make good use of these enhanced global network
resources.
- The Federal HECRTF Report
The goals of this report are: 1) Make high-end computing easier and more
productive to use; 2) Foster the development and innovation of new generations
of high-end computing systems and technologies; 3) Effectively manage and
coordinate federal high-end computing; and 4) Make high-end computing readily
available to federal agencies that need it to fulfill their missions.
The core of the plan is a federally-funded HEC research and development
initiative that is billed as "a coordinated, sustained research program over
10 - 15 years to overcome major technology barriers that limit effective use
of high-end computer systems". Yet neither the financial commitment nor the
research agenda is there. In what should have been the heart and soul of the
document (pp. 13-20) on HEC research and development, all one finds is a vague
sense that there are a few things here and there that need fixing, but nothing
to get excited about.
The superficial overview of hardware technology challenges is embarrassing.
As for software technology challenges, the Task Force flirts with the idea
(p. 6) that a "common system software base [might] deliver needed improvements
in sustained application performance, ...". This just isn't serious.
Architectures barely deserve mention among the systems technology challenges.
The hardware, software, and systems roadmaps (pp. 18-20) are hardly better.
In an appendix (p. 66), the Task Force considers establishment of a Joint
Management Office (JMO) to oversee the R&D portfolio. The Task Force writes,
"The Office would conduct integrated planning, solicitation preparation,
project selection and execution, and progress reviews, ...". This is clearly
the right governance structure for a high-end computing program. One critical
question remains: How do we ensure that the Joint Management Office will
provide vigorous leadership in high-end computing when the HECRTF Task Force
has shown itself incapable of providing any leadership at all?
The first observation is that DARPA's High-Productivity Computing Systems
(HPCS) program enables research, development, and engineering that otherwise
would not be conducted: it comes to fill an _enormous_ need. Moreover, it is
a polite fiction to claim that HPCS is only "bridge" funding until we get to
quantum computing; more realistically, we should ask, who will fund the
research required for the machine we need at the end of the _next_ decade?
The first fundamental weakness of HECRTF is the absence of a long-term
research component. The HEC R&D pipeline must be properly stoked for the
foreseeable future. The federal government must make a _visible financial
commitment_ to HEC R&D well beyond the expiration of the HPCS program in 2010.
The second observation is that no non-incremental research and development
plan is meaningful without a clearly defined research agenda; the governance
structure for a high-end computing program must include mechanisms to ensure
effective leadership in the field of high-end computing. The track record of
the federal government indicates that it does not speak with a consistent
voice and cannot internally generate this leadership. The second fundamental
weakness of HECRTF is that it neither proposes a suitable research agenda nor
considers mechanisms adequate to generate such an agenda.
Who sets the agenda for HEC R&D? Presumably, the high-end users at mission
agencies, government labs, large industrial corporations, major research
centers, etc., articulate their high-end computing needs, i.e., the
computational requirements of their most-critical applications. Then,
computer vendors---the people with experience in building computer systems---
look at their technology roadmaps and business plans and decide which of these
needs they are willing to address. Academic experts and government officials
can mediate and/or facilitate communication.
Both the IHEC report and the "Roadmap" report indicate that high-end computing
needs can be articulated and that integrated planning can be performed. But
the HECRTF report demonstrates that the federal government has not---
apparently---agreed on a mechanism whereby an informed group of people (users,
builders, experts, officials) can get together and set the agenda. In fact,
HECRTF expurgates several promising detailed agendas from previous reports.
What are some of the research priorities your correspondent favors? An easy
pick is, explore computing substrates to replace silicon in the sunset years
of Moore's law.
Perhaps the most pressing needs lie in the general area of computer systems:
computer architectures, including the architectures of computer global system
interconnection fabrics; programming languages and systems; and system
software, including especially operating systems (and their instability in
large-scale parallel machines).
Much has already been said in support of balanced high-bandwidth computer
architectures. Still, your correspondent recommends a special push in the
area of global system interconnects. New device technology (billions of
transistors) enables new intelligent routers that in turn enable new
interconnect architectures. Slightly further out, we will see new technology
in place of the old (e.g., optical interconnects and switches in place of
their electrical counterparts). But nobody has really pushed this area
because of shortfalls in processor parallelism (or inadequate financial
resources). Your correspondent would like to see a truly large-scale global
system interconnect with _exceptional_ as uniform as possible bisection
bandwidth, say, one suitable for use in a balanced petaflops-scale system.
An interconnect with PBs/s of bisection bandwidth has never been designed,
much less built.
Another old problem that has suddenly become new and pressing is, design a
very high-level architecture-specific shared-memory parallel programming
language that reconciles programmability and performance. This will overcome
the two barriers to productivity. Why architecture specific? Well, a
language designer can conceive of new parallel programming abstractions that
ease the programming burden. However, architecture and language are
inextricably linked, and we need to design _specific_ hardware mechanisms in
order to support these programming abstractions with high performance. Most
production parallel programming languages presuppose a machine model, often an
archaic one. It is time to accept the interdependence of architecture and
language design. This is another area where vendor participation is
essential.
Outside of high-end computing specifically, your correspondent prioritizes
information-systems security, computer-aided reasoning for large-scale
information management, and complexity management of computing systems
architectures at all levels (from circuits to global-scale distributed
systems).
It may be that high-end computing, and more generally computing itself, is
traversing a temporary fall-off of enthusiasm. The intellectual excitement
is less because the field has ceased to focus on its real problems. The cure?
Articulate the innovative visions and grand challenges in computing that alone
can give some of the deep satisfactions had by people working in the physical
(and biological) sciences. Computer-systems research has just begun.
Computing, and high-end computing, will excite people when they have regained
their focus on the deep computer-science problems that really matter.
The High-End Crusader, a noted expert in high-performance computing and
communications, shall remain anonymous. He alone bears responsibility for the
opinions here. Comments are always welcome and may be sent to HPCwire editor
Tim Curns at tim@hpcwire.com.
|