
Features:
NEC SX-6 - TWO TIMES MORE COST EFFICIENT THAN IBM P4?
By Christopher Lazou
September 7-11, 2003: As stated in my previous article, some 90 meteorologists
and HPC experts from 15 countries and 4 continents attended the bi-annual
CAS2K3 workshop on the use of HPC in meteorology, held at the idyllic Imperial
Palace Hotel, Annecy, France, organised by the National Centre for Atmospheric
Research (NCAR), USA. There were 43 presentations in 4 days and a live
demonstration of the Grid's enabling potential for international collaboration
within the community of climate system modelling (CCSM). Most presenters came
from sites in the USA with large IBM P3/4 systems, while the European
contingent included a strong representation from sites with large NEC SX-6
systems.
The first article concentrated on Earth System Modelling and how researchers
try to do the best with available computer resources to predict climate
change. This second article highlights a few of the many system performance
issues raised by presentations given at the CAS2K3 workshop.
Dr. Richard Loft, from NCAR - a Gordon Bell 2001 award winner, opened the
debate by addressing the supercomputing challenges at NCAR, reviewing
supercomputing trends and constraints. He looked at microprocessor efficiency
and assessed what improvements are possible using the IBM P690 as a compute
engine. He then showed some RISC/Vector cluster comparisons and came up with
some stark conclusions concerning the myths surrounding commodity chip
processors.
Climate scientists project a need for 150 times more computing power over the
next five years. Doubling the horizontal resolution increases computational
costs eightfold. Introducing super-parameterization of moisture processes
would increase computing power needs and costs dramatically. He then went on
to detail how cooling plant and electrical power as well as computer facility
space act as limiting factors at NCAR. For example, NCAR's cooling capacity is
450 tons, requiring 1.58MW electrical power of which 1.2MW is used by the
current set of computers. Another major growth area is data generation and the
cost of data managing and handling. NCAR has 1.3PetaBytes of data, increasing
at the rate of 3TeraBytes a day. Both data size and overall costs of handling
are on the increase, setting an alarming trend.
The computing power at NCAR consists of 1024 IBM 690 (1.3GHz) Power 4 cluster,
rated as 5.2Teraflop/s peak performance, plus 283 IBM (375MHz) Power 3
cluster, of 1.7Teraflop/s peak.
NCAR analyzed their workload performance and found that newer IBM systems,
i.e. the 8-way IBM P4 (at 4.1% efficiency), are less efficient than older
ones, i.e. the IBM P3 (at 5.7% efficiency). This probably explains why NERSC
chose to upgrade using IBM P3 processors instead of the newer IBM P4. NCAR
also found that larger 32-way P4 nodes are more efficient, delivering 4.5%.
When they delved under the covers it was found that applications in their
workload were memory bandwidth limited. A simple bandwidth model predicts 5.5%
efficiency, but the imbalance of the colony switch with the P4 processor speed
reduces efficiency further. Latency considerations should allow P processors
to scale as log(P). In actuality the P4 scales linearly, which is poor. The
NCAR people then did some nifty hand tuning of 3-D FFT kernels and managed to
improve their code efficiency from 3.5% to 4.7% when using 64 IBM P4
processors. More detailed charts were presented dealing with various internal
aspects of the codes within their workload. What was observed is that the
efficiency ranged between 4.1% to 4.5% of peak and the maximum sustained
performance on the NCAR workload was 213.5 Gigaflop/s out of 5.234Teraflop/s
peak.
Rich Loft went on to calculate the estimated peak price performance on the IBM
P4 system at NCAR as $2.6 per Megaflop/s and sustained price performance as
$59 per Megaflop/s. The sustained electrical power performance was
0.7Gigaflop/s per kilowatt. He then compared it with the Earth Simulator using
Dr. Sato's results and came up with an estimated peak price performance of
$8.5 per Megaflop/s and estimated sustained price performance of $28 per
Megaflop/s. The sustained electrical power performance was 1.525Gigaflop/s per
kilowatt.
He concluded: "at this point in time the vector based Earth Simulator (NEC SX-
6) appears to be twice more cost efficient (dollars/Gigaflop/s) in both price
and electrical power performance than the IBM P690 P4, when using sustained
performance as a measure". (This is before the 30% price/performance
improvement to the SX-6 announced on the 18th September. See NEC Press Release
in this publication).
He explained that this is because the NCAR workload is bandwidth limited and
RISC cluster (IBM P4) interconnect is not great. He also indicated that
infrastructure (power, cooling, space) is becoming a critical constraint for
NCAR. The IBM contract at NCAR hit the buffers and had to be renegotiated, to
avoid interfering with the IPCC work schedules. For this reason another 448
(1.3GHz) IBM P4 processors, an additional 2.3Tflop/s peak, are being installed
(September'03) and the Federation switch is now delayed to 2H04. Thus, the
myth that commodity chip computers are cheaper has been once again, debunked.
The message, that capacity computers such as the IBM P4+ systems are
unsuitable for high resolution Earth System Modelling, was re-enforced by many
of the speakers. For example, as stated in my last article, Dr. Bert Semtner,
Naval Postgraduate School, Monterey, and Dr. Bill Collins, NCAR, whose main
thrust was that only systems of multiple Teraflop/s sustained performance can
be used to project climatic conditions out for many centuries, with highly
resolution Earth System Models. For example, using a model of ~6.5Km spacing
over the ocean, on a 500 processors IBM P3, it took eight days to simulate
fifteen years. This same model simulated 300 years in just eight hours on 960
processors of the Earth Simulator (NEC SX-6).
As far as silicon use, the size of the super-scalar processor chip on the IBM
P4 and the vector NEC SX-6 processor chip is about the same, but the IBM P4
has 3 times higher transistor density (170M) as compared to the vector SX-6
chip (57M). This is an instance where more density (with its inherent
complexity) means less value.
As reported in my last week's article from CAS2K3, in ESM applications the IBM
P4 falls short in capability and from the results reported above, is also poor
in cost/performance measures. Utilizing thousands of IBM type processors would
not help, according to results from the presentation by Patrick Worley, Oak
Ridge National Laboratory. Scaling would act as a major limiting constraint.
This is where commodity capacity chip systems are getting problematic. The
moral is, keep the architecture simple, with fewer things to get in the way
and impede application performance.
To temper what is said above, Bill Kramer from NERSC reported that for some
applications the IBM P3 delivers good efficiency. Below is a list of examples
given: 1. Tera-scale simulation of Supernova explosions 35% efficiency on 2048
CPUs. 2. Accelerator science and simulations 25% efficiency on 4096 CPUs. 3.
Electromagnetic wave-plasma interactions 68% efficiency on 2048 CPUs. 4.
Quantum Chromodynamics at high temperatures 13% efficiency on 1024 CPUs. 5.
Cosmic Microwave background data analysis 50% efficiency on 4096 CPUs. There
are however, no Climate codes on his list yet.
Most sites represented at CAS2K3 were using two contrasting types of platform,
namely the vector parallel NEC SX and the IBM scalar P3/4 for weather/climate
predictions and data management. A few from SGI, Fujitsu and Cray were also
present.
The two main vendors have contrasting business models. The NEC product line as
presented by Dr. Joerg Stadler consists of a pyramid with the powerful SX
series at its apex, suitable for demanding applications such as high
resolution ESM. The NEC TX7 (based on the Intel Itanium line) is offered for
servers used for less demanding applications, including data management, as
the middle of the pyramid. Clusters of PCs are offered for less demanding
applications as a broad base. This allows NEC to deliver suitable systems to
match the computational needs of the whole range of applications. Joerg then
described how NEC HPC Europe has become a total solutions provider. It used
the example of the recently successful installation of the DKRZ system. The
solution offered by NEC HPC Europe was to take control of the total system
integration process and deliver the service within the agreed budget. This
involved using its long-established hardware and software engineering skills
to select, install and maintain the total system, for delivering the services
to fulfil the DKRZ mission. This included planning support, site
specification, capacity analysis, air conditioning, machine and cabling
layout, manpower requirements for operation, security strategy.
NEC used its own products, the NEC SX series high compute servers for
numerical calculations and for data-handling the AsAmA servers. It also
incorporated elements from other hardware and software vendors, StorageTek,
the Legato hierarchical file system (GFS) and the ORACLE database running on
top of Linux, to deliver an optimal solution.
In contrast the IBM Power 4 product line is used across most application
domains. IBM argues that this approach allows it to leverage developments from
the large commercial side of its business to benefit the much smaller
scientific/technical market.
Dr. Jamshed Mirza, Chief Systems Architect from IBM, in his presentation
titled: "The path to Petaflop/s", acknowledged that Earth System Modelling
required higher memory bandwidth than currently offered in the IBM P4+ product
line. He said the NEC SX-6 has a memory bandwidth ratio of 4Bytes/Flop, the
Cray X1 3Bytes/Flop and the IBM P4 1.3Bytes/Flop. He then claimed that super-
scalar architectures retrieve results at close to vector rates out of cache.
The primary speed differentiation between mainstream RISC super-scalar and
vector processors is the memory subsystem. IBM can increase memory bandwidth,
but it does not intend to do so. To achieve a vector type balance, both the
memory subsystem and the communication switch have to be improved to address
bandwidth and latency issues. This is an expensive exercise and IBM was not
inclined to go down this road, because it did not make commercial sense when
the cost/benefit integral was mapped across the whole of its customer base.
So, when push comes to shove, the benefits to the scientific/technical
community from the IBM business model are illusory. This was poignantly
illustrated in the NCAR presentation described above.
The bottom line is that high resolution Earth System Modelling is severely
constrained (some say handicapped) on IBM P4 systems, as it takes far too long
to simulate centuries or thousands of years ahead. In addition, from the NCAR
results, the IBM solution also appears to be about twice as expensive compared
to using the Earth Simulator, which is based on the NEC SX-6.
That U.S. ESM research has ended up relying on IBM systems is a testament to
how political interference, banning the import of vector parallel systems from
Japan, can be unhelpful, distorting the market. Although this policy was
recently formally revoked, it still persists in practice. With the NEC path
more or less blocked, the hope for the U.S. ESM community to remedy this
situation is to buy parallel vector systems, possibly the newly developed Cray
X1 (with some 260 units sold). A Study done at the Army High Performance
Computing Research Centre, comparing 1024 CPUs Cray X1 and 5760 (2.8GHz
Pentium 4, IA-32) CPUs cluster, concludes that the Cray X1 with $58.65/Mflop/s
compares favourably on a price/performance basis with the least expensive
cluster over a 5-year life cycle (i.e. is cheaper), without any capability
benefits taken into consideration. Benchmark measurements from work done at
Oak Ridge also show the Cray X1 in good light, performing well with
efficiencies similar to the Fujitsu VPP5000 for up to 64 processors. It
appears to have slightly lower performance (24GB/s) memory bandwidth than the
NEC SX-6 (31.8GB/s), as measured on a Stream Triad and about an order of
magnitude higher than the IBM P4, confirming the Bytes/Flop analysis by
Jamshed above. More important than cost/Megaflop/s, the Cray X1 would provide
capability, enabling them to perform high-resolution ESM simulations for long-
term predictions.
The Canadian weather centre reported that NEC lost their site to IBM during
the recent procurement. About the same time the UK Met Office and the
Australian Bureau of Meteorology chose NEC systems. I asked if NEC Japan was
involved in bidding in Canada or whether it was Cray, under the NEC/Cray
agreement for Cray to sell NEC systems in North America. The presenter from
Canada refused to answer, the only unanswered question at CAS2K3. I leave it
to the reader to speculate. The answer may of course be present in the
question.
The views expressed in this article are those of the author alone and do
not necessarily reflect those of HPCwire, its publisher, or its staff.
(Brands and names are the property of their respective owners) Copyright:
Christopher Lazou, HiPerCom Consultants, Ltd., UK. Email:
Chris@lazou.demon.co.uk September 2003.
|