HPCwire
 The global publication of record for High Performance Computing - LIVEwire Edition / November 20, 2003: Vol. 10, No. 3

  |  Table of Contents  |  

Features:

AN INTERVIEW WITH DAN REED, DIRECTOR, NCSA
By Tim Curns, Assistant Editor, HPCwire

HPCwire: What do you see as the biggest obstacles or hindrances to performance optimization techniques for large-scale parallel, distributed and Grid-based computing systems? What do you feel are some options for overcoming these obstacles?

DAN REED: The largest short-term obstacle to optimizing Grid applications is undoubtedly the evolving state of Grid software infrastructure, and concomitantly, the paucity of analysis tools. Similarly, today's parallel systems suffer from a dearth of robust, easy to use, portable tuning tools.

There is no silver bullet that will improve the performance of parallel and Grid applications. Instead, we need sustained investment and support for tools matched to application needs and system characteristics. This is not cheap, nor will it yield "magic solutions" quickly. Hence, the Alliance and NCSA, via the PACI Alliance expeditions, are developing and hardening performance tuning tools for both Grids and Linux clusters.

Concurrently, raising the awareness of the tools and offering training opportunities for researchers is an on-going national emphasis. NCSA and the Alliance continue to offer workshops, Access Grid tutorials and online training materials to engage faculty, post-doctoral research associates and graduate students and to help them apply the latest technologies in their own research initiatives.

Finally, as I testified to the House Science Committee this summer (see http://www.house.gov/science/hearings/full03/jul16/reed.pdf), I believe we must take a long-term, strategic approach to solving these problems. Reducing the gap between peak hardware performance and achieved performance for a broad range of applications will require a long-term strategy that couples academic research (both systems and applications) with industrial prototyping and assessment and with a cycle of procurement that enables strategic planning and system revision based on scientific application experiences. These are 10-20 year challenges -- we need to start now.

HPC: What technological advancements or breakthroughs in research are on the horizon for the National Computational Science Alliance and/or the NCSA? What are your projections on the development of the NSF TeraGrid Project?

DR: Exciting things are happening on many fronts. Many of the societal challenges of the 21st century will require the collaborative skills of researchers in a diverse set of disciplines. NCSA and its Alliance partners are building the infrastructure and scientific collaborations to address these challenges. Let me cite just two examples: HASTAC and LEAD.

The newly launched HASTAC (Humanities, Arts, Science, and Technology Advanced Collaboratory) is an alliance of scientists, humanists, artists, social theorists, legal specialists and information technology specialists. HASTAC was founded on the belief that the future of cyberinfrastructure must be driven by creative discovery across disciplinary divides, given the profound impact of new technologies on individuals and society.

On the scientific front, the new LEAD (Linked Environments for Atmospheric Discovery) NSF ITR award couples Alliance researchers at NCSA, Oklahoma, Alabama and Indiana with partners at the National Center for Atmospheric Research, Colorado State, Millersville and Howard. Given the billions of dollars of annual damage and loss of life from severe storms, LEAD's goal is to create a Grid framework for assimilating, predicting, managing, mining/analyzing and displaying meteorological data.

NCSA and its partners also continue to deploy advanced computing infrastructure. NCSA's 17.7 teraflop Xeon cluster is now being deployed and will enter production this spring. By allocating 3 teraflop sub-clusters to research groups for days, weeks or even months, we hope the system will eliminate one of the most common barriers to shared use of large-scale computing resources: long queue wait times. For a peek at the NCSA hardware deployments, see http://clustercam.ncsa.uiuc.edu.

We are also very excited about the status and the future of the TeraGrid. After two years of planning and development, the first phase of the TeraGrid will enter production at the beginning of 2004, and users have already been allocated time on the TeraGrid's distributed resources. Friendly users have been running applications on the TeraGrid for the past several months, and research results are already being published based on these computations. The hardware for phase two TeraGrid deployment is already arriving at NCSA, where it will be assembled to create a 10 teraflop Itanium family Linux system. We are soliciting additional Grid applications, both within the U.S. and international collaborations, for TeraGrid deployment.

HPC: How important are input/output characterizations and parallel file systems in developing high-performance implementations of parallel applications?

DR: Optimizing I/O activity is increasingly critical. The explosive growth of experimental data, from a new generation of scientific instruments, and of computational data, from high fidelity simulations, means that large-scale data management and mining are central to gaining scientific insights. Many sites now have multiple petabyte archives and have or are deploying petabyte secondary storage systems. This infrastructure supports such projects as the National Virtual Observatory, LIGO, the upcoming Large Hadron Collider and biological genomic and protenomics data.

On the technology front, however, disk storage capacities are rising far more rapidly than disk bandwidths, often leading to "write only" data storage. I/O has long been the "poor stepchild" of high performance computing, and we are seeing the effects of this in I/O systems poorly matched to application needs. We glibly speak of teraflops, but we rarely speak of terabytes/second, most often because current systems are not architected or procured to sustain such I/O bandwidths.

We need a deeper understanding of the I/O patterns that occur in parallel and Grid applications to guide the design of parallel I/O libraries and file systems. Understanding I/O behavior at scale is an ongoing research topic, both at NCSA and in my own research group. We are characterizing the I/O behavior of applications on HPC systems, looking at the effects of multilevel mediation by I/O libraries and file systems. In turn, we are using these insights to investigate I/O policies that exploit the temporal and spatial I/O patterns.

HPC: Why is exploring the utility and performance of game systems (specifically, Sony PlayStation2 clusters) important? How can research of these systems benefit both scientific computing and high-resolution visualization?

DR: The history of computing shows that each computing generation has been partially or totally supplanted by systems that occupy a different point on the price/performance curve, expanding the base of possible owners and users. Mainframes and computer families like the IBM S/360 replaced "one of a kind" research systems and made computing part of the corporate culture. In turn, DEC's introduction of the minicomputer gave laboratory groups direct access to affordable computing. Workstations and PCs, driven by the emergence of powerful microprocessors, made computing broadly available to individual researchers and consumers. The common theme across these computing generations has been a dramatic decrease in price, an associated increase in performance, emergence of new market niches and a consequent expansion of the number of units sold.

Game consoles, with price points below $300, performance rivaling or exceeding that of PCs and graphics capabilities recently found only on high-end visualization supercomputers, are the vanguard of yet another computing generation. Moreover, market forces and fierce vendor competition continue to fuel technical innovation and performance improvements on these game platforms, creating research and development incentives and deployment opportunities in new scientific domains.

NCSA's mission is to track technology trends and deploy new infrastructure that can catalyze scientific discovery. As an early test vehicle, NCSA has assembled a 0.6 teraflop Linux PlayStation2 cluster. Using this cluster, we are vectorizing key numerical library kernels and investigating the partitioning of applications across game platform interactive and vector processors. We expect insights from these experiments to inform development and acquisition plans multiple years in the future.

HPC: Feel free to offer any other comments on HPC or related topics!

DR: Several recent developments in high-end computing have stimulated a re-examination of current U.S. policies and approaches. These developments include the deployment of Japan's Earth System Simulator, concerns about the difficulty in achieving substantial fractions of peak hardware performance on high-end systems, and the ongoing complexity of developing, debugging and optimizing applications for high-end systems. In addition, there is growing recognition that a new set of scientific and engineering discoveries could be catalyzed by access to very large-scale computer systems -- leadership computing systems in the 100 teraflop to petaflop range. Finally, the need for high-end systems in support of national defense has led to new interest in high-end computing research, development and procurement.

This summer, in response to a request from the interagency High-End Computing Revitalization Task Force (HECRTF), several of us helped organize a community workshop to provide suggestions on strategic directions for high-end computing. The slides from the community workshop are available at http://www.cra.org/Activities/workshops/nitrd and copies of the final report will be available at SC2003.

In brief, the common theme of the workshop report is the need for sustained investment in research, development and system acquisition. This sustained approach also requires deep collaboration among academic researchers, government laboratories, industrial laboratories and computer vendors. Short-term strategies and one-time programs are unlikely to develop the technology pipelines and new approaches needed to realize the petascale computing systems needed by a range of scientific, defense and national security applications. Rather, multiple cycles of advanced research and development, followed by large-scale prototyping and product development, will be required to develop systems that can consistently achieve a high fraction of their peak performance on critical applications, while also being easier to program and operate reliably.


Top of Page

  |  Table of Contents  |