
Features:
ECMWF WORKSHOP -- TERAFLOP/S CHALLENGES FOR NWP, METEOROLOGY
by Christopher Lazou, HiPerCom Consultants
"...One tiny flaxuation of a Top500 list position may forever change the
future course of a computer vendor's fortunes...." (In the spirit of Lorenz,
1963)
ECMWF, Reading, 25-29 October 2004: Over 150 meteorology experts, computer
practitioners and vendor representatives, spent a week exchanging experiences
about the latest results in meteorology and the computer infrastructure which
goes with it. This excellent relatively small and friendly workshop provided a
forum for the creme-de-la-creme of HPC users. What followed was a tour de
force in meteorological and computing techniques by active practitioners
striving to maximise the latest HPC technology to refine and improve their
weather and climate forecasting models. They presented today's practical
reality, followed by their aspiration and vision for Teraflop/s computing and
beyond. To give some idea, there were a total of 56 presentations and a
discussion panel. Most of these presentations were by experts from major
meteorological centres, from the USA, Canada, Europe, India, Japan, Australia
and China. The rest were from HPC vendors (CRAY, FUJITSU, HP, IBM, INTEL, NEC,
QUADRICS, SGI and TERASCALE). Friday was devoted to a brain storming debate,
hoping to identify solutions to the many pressing needs, of this ever
increasingly important field of science.
With almost every presentation meriting an article of its own, the selection
of material in this article is technology biased and although somewhat
arbitrary, hopefully captures the essence of what was presented and also
highlights some news items.
In recent years meteorology evolved from its esoteric weather prediction role
and became a high profile e-business, with enormous commercial cloud. With
climate change manifesting itself in extreme weather patterns, be it droughts,
rain floods, or, more destructive hurricanes, the economic stakes are high.
The field of meteorology can marshal large budgets needed for data collection,
assimilation, and the purchase of large-scale computer systems for numerical
modelling, so naturally computer vendors are keen to participate in the
deliberations and offer previews of their future products.
In weather and climate numerical modelling, the debate on price/performance of
commodity systems verses special purpose systems is somewhat irrelevant. When
a hurricane is heading for land, capability computing at a level required to
deliver advance predictions in time to implement protection procedures, is the
only measure worth considering. The imperative is for the ocean model to run
on a fine mesh, in hundreds of metres say, not in tens of kilometres, and
deliver results on time.
As Suresh Shukla, Boeing, so aptly put it at the recent IDC European User
Forum: "HPC industry is focusing more on price and cost than on capability and
cost-benefit. Boeing understands the benefits of HPC and in the future needs
balanced systems with latency and bandwidth improvements of at least two
orders of magnitude; efficient implementation of parallelism in a reliable
manner; applications able to exploit hardware enhancements; and reasonable
quality enhancements per cost to justify replacing experimentation with
simulation".
Toshiyuki Furui, of NEC HPC Japan, gave an informative presentation for the
whole range of their product lines, from PC clusters, IPF-blades and Itanium2
based TX7 clusters to the vector parallel SX series. The diamond product in
NEC's pack is the new NEC SX-8, with the potential to deliver the world's
highest computing performance, 65Tflop/s when configured using a total of
4,096 processors in 512 nodes. The 16Gflop/s SX-8 vector processor (vector and
scalar units) is implemented on a single chip using 90-nanometer copper
interconnects. Pipelines of the vector unit operate at a 2GHz clock frequency,
double the speed of NEC's previous system, the SX-6. In addition the vector
square root is implemented in hardware enhancing overall sustained
performance; SQRT on the SX-8 is six times faster than its software
counterpart on the SX-6. The SX-8 footprint is reduced by 25% and power
consumption by 50% as compared with previous models. This was achieved by
applying high-density packaging technology, with processor(s) and memory
implemented on a single module. The 262TB/s peak data transfer rate between
memory and CPU(s) and a memory capacity of up to 64TB, delivers an overall
improvement of more than three times the SX-6 and makes this system most
suitable for a high productivity workhorse engine. A maximum configuration
should be capable of routinely delivering 20 to 25Tflop/s, sustained
performance for large-scale applications. This is likely to be an order of
magnitude more powerful than commodity based systems, i.e. it is a real
supercomputer. Incidentally, each vector processor also has a 2Gflop/s scalar
processor on the same chip, an additional 8Teraflop/s of scalar performance in
the maximum configuration case.
Of course one is aware that to support the whole range of applications, a
range of systems is needed, including scalar ones. Listening to vendor
presentations it became clear to me that in the next three to five years a
number of supercomputers would have heterogeneous architectures with both
scalar and vector processors tightly integrated with low latency high memory
bandwidth and communications switch. These will be residing within a single
system, enabling different sections of an application code running on
different components.
In his vendor presentation, Per Nyberg briefly described Cray's current
product lines, namely the Cray X1, XT-3 and XD1 and then continued in the same
vein as the NEC presenter, highlighting productivity. Nyberg went on to
present the future Cray Rainier system, Cray's version of integrated
scalar/vector computing. Rainier is an integrated single system containing
both successor X1 vector processors and successor AMD Opteron scalar
processors integrated with a common hardware infrastructure (cabinet, power,
cooling and so on) and common high speed network. It will have a common global
address space, a common Operating System, common storage and administration.
According to Nyberg, first customer shipment of this heterogeneous compute
system is expected to be in year 2006.
Don Grice, described the two prong developments at IBM, namely the Power 5/6
and the Blue Gene/L. He went on to say that IBM listened to the scientific
community and its new CPU designs are science driven. He reviewed the physical
limits of device technology and explained why scaling breaks down. His
explanation went like this: Consider the gate oxide in a CMOS transistor (the
smallest dimensions today). Assume only one atom high "defects" on each
surrounding silicon layer. For a modern "scaled" oxide, 6 atoms thick, 33%
variability is induced. The bad news is that single atom defects can cause
local current leakage 10-100X higher than average. The really bad news is that
such "non-statistical behaviours" are appearing elsewhere in technology. He
went on to say: "Integration, the creation of systems rather than just "chips"
will become the means which past trajectories (Moore's Law) for computing
performance are maintained. Only the simultaneous optimisation of materials,
devices, circuits, cores, chips, system architecture, and system software,
provides an effective means to optimise for both performance and power. This
means that IBM future products are having not only homogeneous symmetric
multi-cores with specialised instructions, but also heterogeneous cores with
specialised architectures. In short expect asymmetric cores, with scalar,
vector (ViVA) and graphic processors for image recognition. These are likely
to be first introduced in games consoles.
On the Itanium front Herbert Cornelius gave a brief overview of Intel's
plans
culminating in the Montecito with 1.7Billion transistors, multi-core, multi-
threading chip. He mentioned in passing that most of the world's fastest
systems use Intel chips. Both Gerardo Cisneros from SGI and Herbert
Cornelius
from Intel described the benefits of clusters of SMP systems and briefly
described the NASA project Columbia, the 10,240 Intel Itanium2 consisting of
20x512processors, SGI Altix nodes. The real test is whether scientists at
NASA can predict extreme events such as hurricane paths far ahead of hitting
land and save human life and property. Or, shorten aeroplane design times as
Suresh Shukla claims: "Boeing's Cray X1 is doing work, such as on new
aeroplane design, in three hours that took three days in the past.
Let us indulge in a thought experiment to estimate sustained performance on
vector (C-Type) and scalar (T- Type) systems. A vector-based system such as
the NEC SX-8 with a 16Gflop/s peak, delivers about 90% to LINPACK and we
know
from measured results on the Earth Simulator using the SX-6, the E.S.
delivers
about 30-35% sustained performance to large applications. This translates to
4.8 to 5.6Gflop/s sustained; thus a 26-nodes (208 processors) system will
deliver 1 to 1.4Teraflop/s, sustained performance. Now consider the ECMWF
IBM
P690+ configuration with the Federation switch. It delivers about 60% to
LINPACK and around 5-8% sustained performance to a large application. Let's
take ECMWF as a typical case study. They have 136-nodes, (of 32 processors
per
shared memory node). These are in 2 clusters of 2176 processors. Each
processor is rated at 7.6Gflop/s peak, a total of 33.075Teraflop/s peak
performance. ECMWF contracted with IBM for a system with 1Teraflop/s,
sustained performance. This translates to 3% of the peak performance of the
above configuration, which I suspect is well within the performance range of
the installed system. Deborah Salmond from ECMWF, in her presentation
titled:
"Early experience with the IBM P690+ at ECMWF", measured 6.89% sustained
using
2048 processors, on their IFS (T799 L91) model for a 10 day forecast. This
translates to a measured 1.073Teraflop/s of sustained performance. 4D-Var
calculations for (T799/T95/T255 L91), employing 128x8 (1024) processors, the
efficiency is 5%.
The SX-8 is likely to perform even better than the above thought experiment
suggests, because although the vector speed increased 2 times that of the SX-
6, other improvements such as data transfer rates between memory and CPUs
increased fourfold. The NEC presenter claimed that the SX-8 delivers more than
three times the sustained performance of the SX-6. Using this in our exercise
one can calculate that one needs around 22-25 nodes, less than 200 SX-8 CPUs
to deliver the 1Teraflop/s when running IFS. This is roughly equivalent to 68-
nodes, 2176 processors of the IBM P690+ when comparing sustained performance.
Remember this is just a thought experiment, as I have not seen any results
from the SX-8. As for cost/performance of these two systems the combinations
are endless. Price charged by vendors is very flexible depended on marketing
decisions and barring political interference customers should run their own
benchmarks include factors such as electric power costs, ease of use to arrive
at total cost of ownership and make their own informed choice.
The next article will report on the progress in providing a unified Earth
System Model (ESM) and efforts for standardising interfaces to allow coupling
of various models.
(Brands and names are the property of their respective owners) Copyright:
Christopher Lazou, HiPerCom Consultants, Ltd., UK. November 2004.
|