HPCwire
 The global publication of record for High Performance Computing / November 5, 2004: Vol. 13, No. 44

  |  Table of Contents  |  

Features:

ECMWF WORKSHOP -- TERAFLOP/S CHALLENGES FOR NWP, METEOROLOGY
by Christopher Lazou, HiPerCom Consultants

"...One tiny flaxuation of a Top500 list position may forever change the future course of a computer vendor's fortunes...." (In the spirit of Lorenz, 1963)

ECMWF, Reading, 25-29 October 2004: Over 150 meteorology experts, computer practitioners and vendor representatives, spent a week exchanging experiences about the latest results in meteorology and the computer infrastructure which goes with it. This excellent relatively small and friendly workshop provided a forum for the creme-de-la-creme of HPC users. What followed was a tour de force in meteorological and computing techniques by active practitioners striving to maximise the latest HPC technology to refine and improve their weather and climate forecasting models. They presented today's practical reality, followed by their aspiration and vision for Teraflop/s computing and beyond. To give some idea, there were a total of 56 presentations and a discussion panel. Most of these presentations were by experts from major meteorological centres, from the USA, Canada, Europe, India, Japan, Australia and China. The rest were from HPC vendors (CRAY, FUJITSU, HP, IBM, INTEL, NEC, QUADRICS, SGI and TERASCALE). Friday was devoted to a brain storming debate, hoping to identify solutions to the many pressing needs, of this ever increasingly important field of science.

With almost every presentation meriting an article of its own, the selection of material in this article is technology biased and although somewhat arbitrary, hopefully captures the essence of what was presented and also highlights some news items.

In recent years meteorology evolved from its esoteric weather prediction role and became a high profile e-business, with enormous commercial cloud. With climate change manifesting itself in extreme weather patterns, be it droughts, rain floods, or, more destructive hurricanes, the economic stakes are high. The field of meteorology can marshal large budgets needed for data collection, assimilation, and the purchase of large-scale computer systems for numerical modelling, so naturally computer vendors are keen to participate in the deliberations and offer previews of their future products.

In weather and climate numerical modelling, the debate on price/performance of commodity systems verses special purpose systems is somewhat irrelevant. When a hurricane is heading for land, capability computing at a level required to deliver advance predictions in time to implement protection procedures, is the only measure worth considering. The imperative is for the ocean model to run on a fine mesh, in hundreds of metres say, not in tens of kilometres, and deliver results on time.

As Suresh Shukla, Boeing, so aptly put it at the recent IDC European User Forum: "HPC industry is focusing more on price and cost than on capability and cost-benefit. Boeing understands the benefits of HPC and in the future needs balanced systems with latency and bandwidth improvements of at least two orders of magnitude; efficient implementation of parallelism in a reliable manner; applications able to exploit hardware enhancements; and reasonable quality enhancements per cost to justify replacing experimentation with simulation".

Toshiyuki Furui, of NEC HPC Japan, gave an informative presentation for the whole range of their product lines, from PC clusters, IPF-blades and Itanium2 based TX7 clusters to the vector parallel SX series. The diamond product in NEC's pack is the new NEC SX-8, with the potential to deliver the world's highest computing performance, 65Tflop/s when configured using a total of 4,096 processors in 512 nodes. The 16Gflop/s SX-8 vector processor (vector and scalar units) is implemented on a single chip using 90-nanometer copper interconnects. Pipelines of the vector unit operate at a 2GHz clock frequency, double the speed of NEC's previous system, the SX-6. In addition the vector square root is implemented in hardware enhancing overall sustained performance; SQRT on the SX-8 is six times faster than its software counterpart on the SX-6. The SX-8 footprint is reduced by 25% and power consumption by 50% as compared with previous models. This was achieved by applying high-density packaging technology, with processor(s) and memory implemented on a single module. The 262TB/s peak data transfer rate between memory and CPU(s) and a memory capacity of up to 64TB, delivers an overall improvement of more than three times the SX-6 and makes this system most suitable for a high productivity workhorse engine. A maximum configuration should be capable of routinely delivering 20 to 25Tflop/s, sustained performance for large-scale applications. This is likely to be an order of magnitude more powerful than commodity based systems, i.e. it is a real supercomputer. Incidentally, each vector processor also has a 2Gflop/s scalar processor on the same chip, an additional 8Teraflop/s of scalar performance in the maximum configuration case.

Of course one is aware that to support the whole range of applications, a range of systems is needed, including scalar ones. Listening to vendor presentations it became clear to me that in the next three to five years a number of supercomputers would have heterogeneous architectures with both scalar and vector processors tightly integrated with low latency high memory bandwidth and communications switch. These will be residing within a single system, enabling different sections of an application code running on different components.

In his vendor presentation, Per Nyberg briefly described Cray's current product lines, namely the Cray X1, XT-3 and XD1 and then continued in the same vein as the NEC presenter, highlighting productivity. Nyberg went on to present the future Cray Rainier system, Cray's version of integrated scalar/vector computing. Rainier is an integrated single system containing both successor X1 vector processors and successor AMD Opteron scalar processors integrated with a common hardware infrastructure (cabinet, power, cooling and so on) and common high speed network. It will have a common global address space, a common Operating System, common storage and administration. According to Nyberg, first customer shipment of this heterogeneous compute system is expected to be in year 2006.

Don Grice, described the two prong developments at IBM, namely the Power 5/6 and the Blue Gene/L. He went on to say that IBM listened to the scientific community and its new CPU designs are science driven. He reviewed the physical limits of device technology and explained why scaling breaks down. His explanation went like this: Consider the gate oxide in a CMOS transistor (the smallest dimensions today). Assume only one atom high "defects" on each surrounding silicon layer. For a modern "scaled" oxide, 6 atoms thick, 33% variability is induced. The bad news is that single atom defects can cause local current leakage 10-100X higher than average. The really bad news is that such "non-statistical behaviours" are appearing elsewhere in technology. He went on to say: "Integration, the creation of systems rather than just "chips" will become the means which past trajectories (Moore's Law) for computing performance are maintained. Only the simultaneous optimisation of materials, devices, circuits, cores, chips, system architecture, and system software, provides an effective means to optimise for both performance and power. This means that IBM future products are having not only homogeneous symmetric multi-cores with specialised instructions, but also heterogeneous cores with specialised architectures. In short expect asymmetric cores, with scalar, vector (ViVA) and graphic processors for image recognition. These are likely to be first introduced in games consoles.

On the Itanium front Herbert Cornelius gave a brief overview of Intel's plans culminating in the Montecito with 1.7Billion transistors, multi-core, multi- threading chip. He mentioned in passing that most of the world's fastest systems use Intel chips. Both Gerardo Cisneros from SGI and Herbert Cornelius from Intel described the benefits of clusters of SMP systems and briefly described the NASA project Columbia, the 10,240 Intel Itanium2 consisting of 20x512processors, SGI Altix nodes. The real test is whether scientists at NASA can predict extreme events such as hurricane paths far ahead of hitting land and save human life and property. Or, shorten aeroplane design times as Suresh Shukla claims: "Boeing's Cray X1 is doing work, such as on new aeroplane design, in three hours that took three days in the past.

Let us indulge in a thought experiment to estimate sustained performance on vector (C-Type) and scalar (T- Type) systems. A vector-based system such as the NEC SX-8 with a 16Gflop/s peak, delivers about 90% to LINPACK and we know from measured results on the Earth Simulator using the SX-6, the E.S. delivers about 30-35% sustained performance to large applications. This translates to 4.8 to 5.6Gflop/s sustained; thus a 26-nodes (208 processors) system will deliver 1 to 1.4Teraflop/s, sustained performance. Now consider the ECMWF IBM P690+ configuration with the Federation switch. It delivers about 60% to LINPACK and around 5-8% sustained performance to a large application. Let's take ECMWF as a typical case study. They have 136-nodes, (of 32 processors per shared memory node). These are in 2 clusters of 2176 processors. Each processor is rated at 7.6Gflop/s peak, a total of 33.075Teraflop/s peak performance. ECMWF contracted with IBM for a system with 1Teraflop/s, sustained performance. This translates to 3% of the peak performance of the above configuration, which I suspect is well within the performance range of the installed system. Deborah Salmond from ECMWF, in her presentation titled: "Early experience with the IBM P690+ at ECMWF", measured 6.89% sustained using 2048 processors, on their IFS (T799 L91) model for a 10 day forecast. This translates to a measured 1.073Teraflop/s of sustained performance. 4D-Var calculations for (T799/T95/T255 L91), employing 128x8 (1024) processors, the efficiency is 5%.

The SX-8 is likely to perform even better than the above thought experiment suggests, because although the vector speed increased 2 times that of the SX- 6, other improvements such as data transfer rates between memory and CPUs increased fourfold. The NEC presenter claimed that the SX-8 delivers more than three times the sustained performance of the SX-6. Using this in our exercise one can calculate that one needs around 22-25 nodes, less than 200 SX-8 CPUs to deliver the 1Teraflop/s when running IFS. This is roughly equivalent to 68- nodes, 2176 processors of the IBM P690+ when comparing sustained performance.

Remember this is just a thought experiment, as I have not seen any results from the SX-8. As for cost/performance of these two systems the combinations are endless. Price charged by vendors is very flexible depended on marketing decisions and barring political interference customers should run their own benchmarks include factors such as electric power costs, ease of use to arrive at total cost of ownership and make their own informed choice.

The next article will report on the progress in providing a unified Earth System Model (ESM) and efforts for standardising interfaces to allow coupling of various models.

(Brands and names are the property of their respective owners) Copyright: Christopher Lazou, HiPerCom Consultants, Ltd., UK. November 2004.


Top of Page

  |  Table of Contents  |