HPCwire
 The global publication of record for High Performance Computing / September 19, 2003: Vol. 12, No. 37

Previous Article   |  Table of Contents  |  

Features:

USA SUPERCOMPUTING: INTERVIEW WITH THE JASON STUDY DIRECTOR
by Alan Beck, Editor-in-Chief

Background: To better understand the status of supercomputing in the United States, three government studies are underway and some reports are now becoming available. It is expected that these three reports will influence federally-funded research in high-end computing.

The National Research Council's Computer Science and Telecommunications Board convened the Committee on the Future of Supercomputing to conduct a 2-year study. This study is jointly sponsored by the Department of Energy's Office of Science and by its National Nuclear Security Administration's (NNSA) Advanced Simulation and Computing (ASC) Program. An Interim Report: The Future of Supercomputing has been released.

A separate interagency report is expected to include plans for a fresh source of funding for R&D in petaFLOP scale computing and custom architectures. The High-End Computing Revitalization Task Force (HECRTF) will provide a 5-year plan for funding supercomputing research beginning in fiscal year 2005.

NNSA also commissioned a JASON Program summer study to look at the "Computing Requirements for Stockpile Stewardship." [The JASONs represent a group of distinguished scientists chartered by the DoD to advise the agencies of the US government on scientific issues.] The NNSA's Advanced Simulation and Computing Program, historically known as ASCI, is funded to create world-class simulation capabilities in support of the nation's Stockpile Stewardship Program (SSP), and as such was the primary focus of the study.

Roy F. Schwitters is the S.W. Richardson Foundation Regental Professor of Physics for the University of Texas at Austin; currently, he chairs the UT Department of Physics and directs its Center for Particle Physics. He served as the JASON Program study leader.

HPCwire: WHY WAS THE JASON STUDY COMMISSIONED?

Schwitters: This study was chartered to explore the linkage between the requirements of the stockpile mission and the acquisition of computer hardware capability and capacity. In particular, we were tasked to evaluate the increased risk to the nuclear weapons (NW) stockpile and the scientific program of the Stockpile Stewardship Program (SSP) if we were to further delay computer acquisitions intended to advance computing capability. We were also charged to consider the degree of confidence the program should have in our NW simulation capability and then assess the appropriate balance between near-, intermediate- and long-term SSP needs for acquiring new hardware intended to increase computing capability.

HPCwire: THIS SOUNDS LIKE A LOT MORE THAN COMPUTER HARDWARE. IS THAT CORRECT?

Schwitters: That's true. This is really about confidence in our ability to assess the safety and reliability of our aging nuclear weapons stockpile, clearly a national security issue. ASCI has developed requisite tools and methods to the point where these have become an essential part of stockpile stewardship. So what are these tools? These include the development of weapons codes and physics models built on a validated scientific/engineering base, the scientific resources necessary to develop better models, the acquisition of powerful computing platforms and the creation of the supporting hardware and software infrastructure. These components need to be balanced in order for the program to be successful. Thus, the platform costs represent only about 20% of the overall ASCI budget. The greatest existing investment is in the scientific applications and the people. It is also important to note that ASCI is successful because it comprises both experienced scientists plus a very capable cadre of younger scientists who, working closely with the nuclear weapons designers and engineers, have acquired invaluable expertise in developing and optimizing ASCI tools and in establishing and improving the scientific credibility of nuclear weapon simulations. Some notable ASCI accomplishments are described in the JASON report.

HPCwire: WHY HAS ASCI PURSUED MPPs?

Schwitters: I know there is a debate in some circles about whether the move from custom chips in vector architectures to commodity chips in massively parallel processors (MPP) should be reconsidered. Yet, even with all the pioneering work done by weapons laboratories working with vector systems in the 70's and 80's, the large, multi-physics applications that dominate the weapons workload still displayed a relatively large scalar fraction since the algorithms that provided the shortest time to solution were often not the ones most amenable to vectorization. By the early nineties, the convergence of three factors supported the move towards MPPs: (1) low cost commodity processors emerged rivaling in speed those of the custom processors used in the vector supercomputers; (2) parallel computing technology matured to the point that it was possible to contemplate its use on complex, multi-physics weapons codes, and; (3) the end of underground testing drove a requirement for enhanced fidelity weapon simulation capability, and further, this drove massively parallel computing solutions, since requirements for memory and sheer computing power vastly exceeded anything available on a node. For similar reasons related to their own missions other laboratories and agencies, numerous university departments, and many commercial computer science R&D researchers all arrived at roughly the same conclusion, and turned their attention to MPP architectures featuring commodity processors, not vector processors. Commodity based MPPs leverage the substantial investment computer industry makes in meeting the demands of their larger market. This approach is considered by the SSP to be the most cost-effective and efficient means of meeting simulation needs. I should point out that our JASON study considered these questions, but we were not asked to make recommendations on them.

HPCwire: ARE THERE ALTERNATIVE ARCHITECTURES THAT ASCI SHOULD PURSUE?

Schwitters: This question cannot be separated from the long-term requirements coming from the program. A first step, to reduce numerical errors to the point where they no longer mask inadequacies in physical models, requires ~100 teraFLOPS. There is no question that this is attainable with current technologies. However, adding improved physics models and using these routinely on stockpile issues easily requires petaFLOPS. Scaling to petaFLOPS using present machine architectures implies that a very large number of processors - of order 100,000 perhaps - might be needed. Such large numbers raise questions of scalability of code performance and of machine reliability. So, while there is a requirement for petaFLOPS within a decade, it appears that there is no clear path to petaFLOP architecture. As a step towards a solution, the program should lay the groundwork for future capability machines. As one approach, the program may consider the acquisition of "Capability-exploration" machines focused on optimizing efficiency in computation for ASCI problems in order to gain experience with architectures that might plausibly be extended to the petaflop-level. As a supplementary approach, ASCI could continue its internal effort to identify barriers to achieving their stated goals and then focusing investments with interested vendors on optimizing commodity solutions. Identifying plausible alternative architectures is within the purview of the National Research Council committee which is also examining the ASCI program, but mainly from the computer-science perspective.

HPCwire: HOW MUCH COMPUTING POWER DOES SSP NEED?

Schwitters: Two commonly used measures of the overall productivity of ASCI platforms are capability and capacity. The first measure, capability, refers to the maximum processing power (in peak teraFLOPS) that can be brought to bear on any one job. The second, capacity, represents the total combined processing power of all the machines capable of running ASCI codes. A given amount of capability implies capacity in two ways: by its direct contribution to capacity and because a high capability machine can be used in capacity mode by processing multiple less demanding jobs simultaneously. Today, the ASCI platforms of highest capability are LLNL's "White" at 12.3 TF and LANL's "Q" at 20 TF. The next planned acquisitions are SNL's "Red Storm" projected to be 40 TF and LLNL's "Purple C" at 100TF. SSP studies typically call for a mix of a few large jobs, which need the largest available capability, and many smaller jobs. An important issue identified is a factor-of-two over- subscription in ASCI capacity. In other words, demand outstrips supply by a factor of two. This conclusion is well supported by distinct technical requirements. Consequently, there is the call for emphasis being put on the acquisition of additional capacity.

HPCwire: IS THERE A RISK TO SSP IF ASCI PLATFORMS ARE DELAYED?

Schwitters: To assess this risk an acquisition scenario with reduced funding was constructed. To make a long story short, it assumed a modest decrease from planned platform budgets, and the net effect was an acquisition stretch-out that reduced overall capacity to below 1/3 of demand during critical program years this decade. The risk to SSP of such a delay is high, not so much from the delayed capability, but from the very serious reduction in overall capacity. In addition, purchasing the proposed large platforms - Purple C and Red Storm - on a stretched-out, sub-optimal schedule, where their CPUs and other components are bought after their performance/cost prime, appears unwise both in terms of delivered capability and capacity.

HPCwire: WHAT WERE THE RECOMMENDATIONS MADE?

Schwitters: As was mentioned earlier, with regards to NNSA's current ASCI platform acquisition strategy, it is critical that the program boost platform capacity acquisition now, balanced by future capability increases. The program can cope with a factor-of-two over-subscription by prioritizing its requirements and carefully managing its resources; however, going beyond this becomes unmanageable. The likely result would be the delay in meeting SSP technical milestones and/or expensive mitigation programs from elsewhere in the SSP. The second area of risk is the lack of a credible "road map" to attaining PF capability.

Overall efficiency of the applications was also looked at. A strong and valuable effort has been made by the ASCI program to increase efficiency of performance. Similar efficiencies are found in many commercial, engineering and scientific applications. While I applaud efforts to improve efficiency, a continuing investment in improving efficiency is important.

With regard to the rest of SSP, the report has some other recommendations that may not be of much interest to your audience. In particular, the advance nuclear weapons science at every opportunity is encouraged and I am pleased to see some excellent new science emerging in association with ASCI. Better science is the best way to reduce risk and the only possible way to achieve sufficiency in the modeling and understanding of these complex systems.

In summary, the study validated ASCI as essential to stockpile stewardship for contributing to achieving technical milestones, enabling new capabilities with better science, and training a cadre of new weapons experts.


Top of Page

Previous Article   |  Table of Contents  |