
Features:
USA SUPERCOMPUTING: INTERVIEW WITH THE JASON STUDY DIRECTOR
by Alan Beck, Editor-in-Chief
Background: To better understand the status of supercomputing in the United
States, three government studies are underway and some reports are now
becoming available. It is expected that these three reports will influence
federally-funded research in high-end computing.
The National Research Council's Computer Science and Telecommunications Board
convened the Committee on the Future of Supercomputing to conduct a 2-year
study. This study is jointly sponsored by the Department of Energy's Office of
Science and by its National Nuclear Security Administration's (NNSA) Advanced
Simulation and Computing (ASC) Program. An Interim Report: The Future of
Supercomputing has been released.
A separate interagency report is expected to include plans for a fresh source
of funding for R&D in petaFLOP scale computing and custom architectures. The
High-End Computing Revitalization Task Force (HECRTF) will provide a 5-year
plan for funding supercomputing research beginning in fiscal year 2005.
NNSA also commissioned a JASON Program summer study to look at the "Computing
Requirements for Stockpile Stewardship." [The JASONs represent a group of
distinguished scientists chartered by the DoD to advise the agencies of the US
government on scientific issues.] The NNSA's Advanced Simulation and Computing
Program, historically known as ASCI, is funded to create world-class
simulation capabilities in support of the nation's Stockpile Stewardship
Program (SSP), and as such was the primary focus of the study.
Roy F. Schwitters is the S.W. Richardson Foundation Regental Professor of
Physics for the University of Texas at Austin; currently, he chairs the UT
Department of Physics and directs its Center for Particle Physics. He served
as the JASON Program study leader.
HPCwire: WHY WAS THE JASON STUDY COMMISSIONED?
Schwitters: This study was chartered to explore the linkage between the
requirements of the stockpile mission and the acquisition of computer hardware
capability and capacity. In particular, we were tasked to evaluate the
increased risk to the nuclear weapons (NW) stockpile and the scientific
program of the Stockpile Stewardship Program (SSP) if we were to further delay
computer acquisitions intended to advance computing capability. We were also
charged to consider the degree of confidence the program should have in our NW
simulation capability and then assess the appropriate balance between near-,
intermediate- and long-term SSP needs for acquiring new hardware intended to
increase computing capability.
HPCwire: THIS SOUNDS LIKE A LOT MORE THAN COMPUTER HARDWARE. IS THAT
CORRECT?
Schwitters: That's true. This is really about confidence in our ability to
assess the safety and reliability of our aging nuclear weapons stockpile,
clearly a national security issue. ASCI has developed requisite tools and
methods to the point where these have become an essential part of stockpile
stewardship. So what are these tools? These include the development of weapons
codes and physics models built on a validated scientific/engineering base, the
scientific resources necessary to develop better models, the acquisition of
powerful computing platforms and the creation of the supporting hardware and
software infrastructure. These components need to be balanced in order for the
program to be successful. Thus, the platform costs represent only about 20%
of the overall ASCI budget. The greatest existing investment is in the
scientific applications and the people. It is also important to note that ASCI
is successful because it comprises both experienced scientists plus a very
capable cadre of younger scientists who, working closely with the nuclear
weapons designers and engineers, have acquired invaluable expertise in
developing and optimizing ASCI tools and in establishing and improving the
scientific credibility of nuclear weapon simulations. Some notable ASCI
accomplishments are described in the JASON report.
HPCwire: WHY HAS ASCI PURSUED MPPs?
Schwitters: I know there is a debate in some circles about whether the move
from custom chips in vector architectures to commodity chips in massively
parallel processors (MPP) should be reconsidered. Yet, even with all the
pioneering work done by weapons laboratories working with vector systems in
the 70's and 80's, the large, multi-physics applications that dominate the
weapons workload still displayed a relatively large scalar fraction since the
algorithms that provided the shortest time to solution were often not the ones
most amenable to vectorization. By the early nineties, the convergence of
three factors supported the move towards MPPs: (1) low cost commodity
processors emerged rivaling in speed those of the custom processors used in
the vector supercomputers; (2) parallel computing technology matured to the
point that it was possible to contemplate its use on complex, multi-physics
weapons codes, and; (3) the end of underground testing drove a requirement for
enhanced fidelity weapon simulation capability, and further, this drove
massively parallel computing solutions, since requirements for memory and
sheer computing power vastly exceeded anything available on a node. For
similar reasons related to their own missions other laboratories and agencies,
numerous university departments, and many commercial computer science R&D
researchers all arrived at roughly the same conclusion, and turned their
attention to MPP architectures featuring commodity processors, not vector
processors. Commodity based MPPs leverage the substantial investment computer
industry makes in meeting the demands of their larger market. This approach is
considered by the SSP to be the most cost-effective and efficient means of
meeting simulation needs. I should point out that our JASON study considered
these questions, but we were not asked to make recommendations on them.
HPCwire: ARE THERE ALTERNATIVE ARCHITECTURES THAT ASCI SHOULD PURSUE?
Schwitters: This question cannot be separated from the long-term requirements
coming from the program. A first step, to reduce numerical errors to the point
where they no longer mask inadequacies in physical models, requires ~100
teraFLOPS. There is no question that this is attainable with current
technologies. However, adding improved physics models and using these
routinely on stockpile issues easily requires petaFLOPS. Scaling to petaFLOPS
using present machine architectures implies that a very large number of
processors - of order 100,000 perhaps - might be needed. Such large numbers
raise questions of scalability of code performance and of machine reliability.
So, while there is a requirement for petaFLOPS within a decade, it appears
that there is no clear path to petaFLOP architecture. As a step towards a
solution, the program should lay the groundwork for future capability
machines. As one approach, the program may consider the acquisition of
"Capability-exploration" machines focused on optimizing efficiency in
computation for ASCI problems in order to gain experience with architectures
that might plausibly be extended to the petaflop-level. As a supplementary
approach, ASCI could continue its internal effort to identify barriers to
achieving their stated goals and then focusing investments with interested
vendors on optimizing commodity solutions. Identifying plausible alternative
architectures is within the purview of the National Research Council committee
which is also examining the ASCI program, but mainly from the computer-science
perspective.
HPCwire: HOW MUCH COMPUTING POWER DOES SSP NEED?
Schwitters: Two commonly used measures of the overall productivity of ASCI
platforms are capability and capacity. The first measure, capability, refers
to the maximum processing power (in peak teraFLOPS) that can be brought to
bear on any one job. The second, capacity, represents the total combined
processing power of all the machines capable of running ASCI codes. A given
amount of capability implies capacity in two ways: by its direct contribution
to capacity and because a high capability machine can be used in capacity mode
by processing multiple less demanding jobs simultaneously. Today, the ASCI
platforms of highest capability are LLNL's "White" at 12.3 TF and LANL's "Q"
at 20 TF. The next planned acquisitions are SNL's "Red Storm" projected to be
40 TF and LLNL's "Purple C" at 100TF. SSP studies typically call for a mix of
a few large jobs, which need the largest available capability, and many
smaller jobs. An important issue identified is a factor-of-two over-
subscription in ASCI capacity. In other words, demand outstrips supply by a
factor of two. This conclusion is well supported by distinct technical
requirements. Consequently, there is the call for emphasis being put on the
acquisition of additional capacity.
HPCwire: IS THERE A RISK TO SSP IF ASCI PLATFORMS ARE DELAYED?
Schwitters: To assess this risk an acquisition scenario with reduced funding
was constructed. To make a long story short, it assumed a modest decrease from
planned platform budgets, and the net effect was an acquisition stretch-out
that reduced overall capacity to below 1/3 of demand during critical program
years this decade. The risk to SSP of such a delay is high, not so much from
the delayed capability, but from the very serious reduction in overall
capacity. In addition, purchasing the proposed large platforms - Purple C and
Red Storm - on a stretched-out, sub-optimal schedule, where their CPUs and
other components are bought after their performance/cost prime, appears unwise
both in terms of delivered capability and capacity.
HPCwire: WHAT WERE THE RECOMMENDATIONS MADE?
Schwitters: As was mentioned earlier, with regards to NNSA's current ASCI
platform acquisition strategy, it is critical that the program boost platform
capacity acquisition now, balanced by future capability increases. The program
can cope with a factor-of-two over-subscription by prioritizing its
requirements and carefully managing its resources; however, going beyond this
becomes unmanageable. The likely result would be the delay in meeting SSP
technical milestones and/or expensive mitigation programs from elsewhere in
the SSP. The second area of risk is the lack of a credible "road map" to
attaining PF capability.
Overall efficiency of the applications was also looked at. A strong and
valuable effort has been made by the ASCI program to increase efficiency of
performance. Similar efficiencies are found in many commercial, engineering
and scientific applications. While I applaud efforts to improve efficiency, a
continuing investment in improving efficiency is important.
With regard to the rest of SSP, the report has some other recommendations that
may not be of much interest to your audience. In particular, the advance
nuclear weapons science at every opportunity is encouraged and I am pleased to
see some excellent new science emerging in association with ASCI. Better
science is the best way to reduce risk and the only possible way to achieve
sufficiency in the modeling and understanding of these complex systems.
In summary, the study validated ASCI as essential to stockpile stewardship for
contributing to achieving technical milestones, enabling new capabilities with
better science, and training a cadre of new weapons experts.
|