
Features:
TAKING MEASURE OF AMD OPTERON, INTEL ITANIUM2, IBM P690 P4+
by Christopher Lazou
About 200 people attended the 14th machine evaluation workshop at EPSRC
Daresbury Laboratories, UK. As in previous years, of great interest were the
Daresbury Benchmark results and “tailored system” solutions, where clusters
can be cobbled from favoured commodity chips and interconnect networks.
There were exhibitions and presentations from seventeen companies, keen to
promote their readymade products including those based on the Intel Itanium2
processor. Examples of these include, the HP Superdome, the SGI Altix 3000 and
the Bull NovaScale 4000. A strong presence of AMD Opteron and Intel Xeon
systems as well as various models of Blade products from IBM were on display
and indeed available for demonstrations to visitors.
This established two days workshop provided a plethora of distributed memory
benchmark results, compiled by Martyn Guest and his team from Daresbury, from
some 200+ systems including the latest products from vendors using their
latest chips. The Daresbury benchmark suite, used to obtain these results,
consists of many computational chemistry kernel codes, molecular dynamics,
Quantum Monte Carlo, Jacobi Solver, STREAM - measured sustainable memory
bandwidth in HPC (TRIAD), the Ab Initio molecular electronic structure, the
DL_POLY and the parallel molecular dynamics benchmark. The results from
SPECfp2000 and other well-known benchmarks were also presented.
Martyn Guest described their benchmark findings, comparing the performance of
many PC based available systems. Looking at SPECfp_Rate one finds it differs
from SPECfp value. While the SPECfp value performance of both the HP Superdome
and SGI Altix 3000 (1.5GHz) Itanium 2 systems are 30% faster than the IBM P690
P4+ (1.7GHz), using the SPECfp_Rate the performance of the SGI Altix 3000
rises to 75% higher compared to the IBM P4+ (1.7GHz). This improvement in
performance is presumably due to the SGI chipset, which provides the
integration of the memory subsystem as a shared single image and the inter-
processor communication network. The IBM P4+ outperforms both the AMD Opteron
148/2.2GHz and the Dell PW360/3.2GHz P4 Extreme by 11.66% on the SPEfp2000,
but this reduces to 6.4% when the SPECfp_Rate is used as a metric.
The benchmark results compiled by the Daresbury team indicate that the Intel
Itanium2 systems are the bright stars of today, faring well as far as
performance is concerned compared to other super-scalar chips. Performance of
a particular chip tends to vary on different benchmarks, but one can see a
pattern emerging. To give you a flavour, the summary index relative to the IBM
p-series 690/powar 4 (1.3GHz) for the Matrix-97, Chemistry Kernels, GAMES-UK
and DLPOLY benchmarks was found to be as follows:
The IBM P4/1.3GHz (100), the HP Alpha Es45/1.25GHz (96), the Pentium 4
Xeon/3066 (108), the AMD Opteron 848/2.2GHz (128), the IBM P4+/1.7GHz (133),
the SGI Altix3700 Itanium2/1.3GHz (140), Intel Tiger Itanium2/1.5GHz (159),
and HP RX5670 Itanium2/1.5GHz-H (171).
Note that SGI has a 1.5GHz Altix Madison, which should give proportionately
better performance, but no results were presented on Martyn’s summary slide at
this meeting. Also many other Itanium 2 systems, as for example Bull NovaScale
and the NEC TX7 were not tested, hence they are not included in the results.
Martyn explained that for Benchmarks to be realistic only fully populated CPUs
usage should be measured, i.e. when all the CPUs are in use; in the case of
the 32-way IBM P4 this normalises availability of L3 memory, which reduces
performance compared to using all L3 memory for one CPU. Note also that the
results above reflect performance in computational chemistry and this does not
necessarily reflect the performance of these computers in all application
domains. Furthermore, the above results are not normalised on
price/performance, so no specific value-for-money comparisons are made or
implied.
The rest of the workshop consisted of twenty-five presentations from vendors,
users sharing their experience in building “tailored systems” and
presentations from a number of companies, specialising in providing tailored
system solutions from commodity components on demand. Instead of buying pre-
packaged products from traditional vendors, a cluster can be cobbled from
favoured chips and an interconnect network, such as Gigabit, QsNet or Myrinet,
to fit one’s pocket and presumably satisfy computational needs. Below are some
examples of these presentations.
Vendor presentations included a brief enthused description of the IBM Blue
Gene Lite, by James Sexton. He reviewed the hardware and detailed how in the
space of a few weeks the 512 processors prototype was constructed from scratch
was up and running Linux and delivering 70% of Peak performance on Linpack
achieving 73rd place on the TOP500 list. The final system will have 65
thousand processors and scheduled to be delivered to the three Federal Labs
Los Alamos, Livermore and Sandia in 2005. He then projected how this new
technology would enable IBM to offer a personal desktop machine using one Blue
Gene card of 32 processors delivering 180Gflop/s peak performance. A research
group can have a Blue Gene rack of 1024 processors with 5.7Tflop/s peak.
Crispin Keable from SGI described their Altix 3700 system, based on the
Itanium2 chip, projecting developments of the Altix product line until year
2005. He claimed that by then their SGI system would be capable of connecting
up to 16,384 processors of the expected Intel Montecito chip in a single image
shared memory system using Linux.
Both Cray and NEC gave presentations on parallel vector processor (PVP)
supercomputers. David Tanquery from Cray presented some impressive results
from the Cray X1 with substantive performance improvements when using co-array
Fortran instead of MPI for scatter gather operations. He then described the
roadmap to deliver 1Petaflop/s of sustained performance by year 2010. David
also said that the Red Storm system build by Cray for Sandia Labs, consisting
of a fast Cray memory subsystem and inter-processor network, but using the AMD
Opteron chip rather than the Cray X1 chip, is to be offered as a Cray product
named the Cray RS.
The DARPA high Productivity programme was touched upon by all three vendors
(IBM, Cray and Sun Microsystems), who was each awarded $49 million R&D funds,
for the second phase. Also SCALI, QUADRICS and FORCE10 (Ethernet specialists)
presented their latest interconnect products.
Kim Petersen NEC HPC Europe gave a brief history of the NEC parallel vector SX
series systems including their current SX-6/7 and promised follow-up systems
with even higher performance, claiming that while in the late 20th century
memory bandwidth was the key engineering challenge, now it is latency hiding.
He also briefly mentioned the Earth Simulator made from NEC SX technologies,
with its 40 Teraflop/s peak performance, the fastest system in the current
TOP500 list.
Kim went on to say that NEC HPC Europe is not only focusing on selling the NEC
SX series capability computers, but is also offering total solutions for
capacity computers in the server commercial market. The marketing of the NEC
SX-6i, a desk side departmental vector system and the NEC TX7 series, a 32-way
CPU scalar server based on the Intel Itanium2 processor is a testament of this
new approach. The NEC TX7 series is the product-line tracking the Intel
Itanium chip developments and IA-64 Linux. It currently uses the Itanium2
(Madison 64-bit architecture processor) but can be upgraded to incorporate
future Itanium processor families.
Tailored Beowulf systems built for High Performance Computing. In the last few
years, Beowulf systems have been built with some success, aiming to replace
readymade large-scale supercomputers with “cheap” off-the-shelf microchips.
These include very large systems in production at Cornell University, the
Pittsburgh Supercomputing Centre with its 5Teraflop/s system and several in
the Federal Labs.
This “tailored system” paradigm has also been used to build departmental
systems from off-the-shelf components of choice. These typically consist of
several hundred or in a few cases 1000s of processors using AMD Opteron, Intel
Xeon, IBM Power 4 or Intel Itanium and so on chips, cobbled together with a
network interconnect, such as a Gigabit switch, Myrinet or QsNet from
Quadrics. But are these really cheap supercomputing alternatives? With the
exception of very few systems, these clusters are in reality capacity
computers. Even the Pittsburgh system has to be seeing in context. As Ralf
Roskies, scientific director of PSC said in his keynote: “With 750 compute
nodes, it needs a node failure once a year to reduce the Mean Time Between
Failure of the total system to 12 hours and without checkpoint restart
software, large applications requiring the whole system have difficulty in
reaching completion”. In addition, the cost integral in the educational
sites, only include semiconductor components and exclude personnel costs.
The tailored system paradigm is nevertheless spawning a number of small
companies, providing build and maintenance services for made-to-order systems.
For example, the company ClusterVision, started up a year ago, and now
delivers fully functioning systems with all hardware and software integrated
and configured for immediate deployment. Although the customer can choose the
system component mix, for example, the installed Beowulf system at Utrecht
University consists of Intel Xeon processors and InfiniBand from Fabric
Networks Inc., whilst the cluster for computational chemistry at the
University of Manchester uses 240 AMD Opteron processors with their built-in
Hyper-Transport technology. ClusterVision also has a number of partners, which
can provide it with high-end componentsand contractual backing. These include,
ECL Computers, one of Netherland’s largest PC component distributors and NEC
HPC Europe.
In summary, the Beowulf tailored system paradigm is moving into mainstream for
a niche market, but there are still some issues to be resolved. Crucially DIY
scientists inexperienced with computer installation are unaware and not
factoring costs for delivering a robust computing environment, e.g. strong
computer floor space, adequate cooling and power supply. With larger clusters,
maintenance procedures and a fast interconnect network are often needed and
these issues cumulatively exacerbate costs. It would appear that readymade
products from vendors, or, systems from specialist companies, such as
ClusterVision, developed under strict engineering regimes, could still be
better value for money if reliability and integration costs are taken into
account.
With many clusters gaining prominent positions in the TOP500 list, those of
you who have real need for high productivity capability supercomputing, but
short of funds might be tempted to take this path. I leave you with a muse
from another age:
All that glisters is not gold.
(Shakespeare: Merchant of Venice, ii, 7).
Wishing all my readers, Seasons Greetings and a Peaceful Happy New Year.
|