HPCwire
 The global publication of record for High Performance Computing / January 9, 2004: Vol. 13, No. 1

Previous Article   |  Table of Contents  |  

Features:

TAKING MEASURE OF AMD OPTERON, INTEL ITANIUM2, IBM P690 P4+
by Christopher Lazou

About 200 people attended the 14th machine evaluation workshop at EPSRC Daresbury Laboratories, UK. As in previous years, of great interest were the Daresbury Benchmark results and “tailored system” solutions, where clusters can be cobbled from favoured commodity chips and interconnect networks.

There were exhibitions and presentations from seventeen companies, keen to promote their readymade products including those based on the Intel Itanium2 processor. Examples of these include, the HP Superdome, the SGI Altix 3000 and the Bull NovaScale 4000. A strong presence of AMD Opteron and Intel Xeon systems as well as various models of Blade products from IBM were on display and indeed available for demonstrations to visitors.

This established two days workshop provided a plethora of distributed memory benchmark results, compiled by Martyn Guest and his team from Daresbury, from some 200+ systems including the latest products from vendors using their latest chips. The Daresbury benchmark suite, used to obtain these results, consists of many computational chemistry kernel codes, molecular dynamics, Quantum Monte Carlo, Jacobi Solver, STREAM - measured sustainable memory bandwidth in HPC (TRIAD), the Ab Initio molecular electronic structure, the DL_POLY and the parallel molecular dynamics benchmark. The results from SPECfp2000 and other well-known benchmarks were also presented.

Martyn Guest described their benchmark findings, comparing the performance of many PC based available systems. Looking at SPECfp_Rate one finds it differs from SPECfp value. While the SPECfp value performance of both the HP Superdome and SGI Altix 3000 (1.5GHz) Itanium 2 systems are 30% faster than the IBM P690 P4+ (1.7GHz), using the SPECfp_Rate the performance of the SGI Altix 3000 rises to 75% higher compared to the IBM P4+ (1.7GHz). This improvement in performance is presumably due to the SGI chipset, which provides the integration of the memory subsystem as a shared single image and the inter- processor communication network. The IBM P4+ outperforms both the AMD Opteron 148/2.2GHz and the Dell PW360/3.2GHz P4 Extreme by 11.66% on the SPEfp2000, but this reduces to 6.4% when the SPECfp_Rate is used as a metric.

The benchmark results compiled by the Daresbury team indicate that the Intel Itanium2 systems are the bright stars of today, faring well as far as performance is concerned compared to other super-scalar chips. Performance of a particular chip tends to vary on different benchmarks, but one can see a pattern emerging. To give you a flavour, the summary index relative to the IBM p-series 690/powar 4 (1.3GHz) for the Matrix-97, Chemistry Kernels, GAMES-UK and DLPOLY benchmarks was found to be as follows:

The IBM P4/1.3GHz (100), the HP Alpha Es45/1.25GHz (96), the Pentium 4 Xeon/3066 (108), the AMD Opteron 848/2.2GHz (128), the IBM P4+/1.7GHz (133), the SGI Altix3700 Itanium2/1.3GHz (140), Intel Tiger Itanium2/1.5GHz (159), and HP RX5670 Itanium2/1.5GHz-H (171).

Note that SGI has a 1.5GHz Altix Madison, which should give proportionately better performance, but no results were presented on Martyn’s summary slide at this meeting. Also many other Itanium 2 systems, as for example Bull NovaScale and the NEC TX7 were not tested, hence they are not included in the results.

Martyn explained that for Benchmarks to be realistic only fully populated CPUs usage should be measured, i.e. when all the CPUs are in use; in the case of the 32-way IBM P4 this normalises availability of L3 memory, which reduces performance compared to using all L3 memory for one CPU. Note also that the results above reflect performance in computational chemistry and this does not necessarily reflect the performance of these computers in all application domains. Furthermore, the above results are not normalised on price/performance, so no specific value-for-money comparisons are made or implied.

The rest of the workshop consisted of twenty-five presentations from vendors, users sharing their experience in building “tailored systems” and presentations from a number of companies, specialising in providing tailored system solutions from commodity components on demand. Instead of buying pre- packaged products from traditional vendors, a cluster can be cobbled from favoured chips and an interconnect network, such as Gigabit, QsNet or Myrinet, to fit one’s pocket and presumably satisfy computational needs. Below are some examples of these presentations.

Vendor presentations included a brief enthused description of the IBM Blue Gene Lite, by James Sexton. He reviewed the hardware and detailed how in the space of a few weeks the 512 processors prototype was constructed from scratch was up and running Linux and delivering 70% of Peak performance on Linpack achieving 73rd place on the TOP500 list. The final system will have 65 thousand processors and scheduled to be delivered to the three Federal Labs Los Alamos, Livermore and Sandia in 2005. He then projected how this new technology would enable IBM to offer a personal desktop machine using one Blue Gene card of 32 processors delivering 180Gflop/s peak performance. A research group can have a Blue Gene rack of 1024 processors with 5.7Tflop/s peak.

Crispin Keable from SGI described their Altix 3700 system, based on the Itanium2 chip, projecting developments of the Altix product line until year 2005. He claimed that by then their SGI system would be capable of connecting up to 16,384 processors of the expected Intel Montecito chip in a single image shared memory system using Linux.

Both Cray and NEC gave presentations on parallel vector processor (PVP) supercomputers. David Tanquery from Cray presented some impressive results from the Cray X1 with substantive performance improvements when using co-array Fortran instead of MPI for scatter gather operations. He then described the roadmap to deliver 1Petaflop/s of sustained performance by year 2010. David also said that the Red Storm system build by Cray for Sandia Labs, consisting of a fast Cray memory subsystem and inter-processor network, but using the AMD Opteron chip rather than the Cray X1 chip, is to be offered as a Cray product named the Cray RS.

The DARPA high Productivity programme was touched upon by all three vendors (IBM, Cray and Sun Microsystems), who was each awarded $49 million R&D funds, for the second phase. Also SCALI, QUADRICS and FORCE10 (Ethernet specialists) presented their latest interconnect products.

Kim Petersen NEC HPC Europe gave a brief history of the NEC parallel vector SX series systems including their current SX-6/7 and promised follow-up systems with even higher performance, claiming that while in the late 20th century memory bandwidth was the key engineering challenge, now it is latency hiding. He also briefly mentioned the Earth Simulator made from NEC SX technologies, with its 40 Teraflop/s peak performance, the fastest system in the current TOP500 list.

Kim went on to say that NEC HPC Europe is not only focusing on selling the NEC SX series capability computers, but is also offering total solutions for capacity computers in the server commercial market. The marketing of the NEC SX-6i, a desk side departmental vector system and the NEC TX7 series, a 32-way CPU scalar server based on the Intel Itanium2 processor is a testament of this new approach. The NEC TX7 series is the product-line tracking the Intel Itanium chip developments and IA-64 Linux. It currently uses the Itanium2 (Madison 64-bit architecture processor) but can be upgraded to incorporate future Itanium processor families.

Tailored Beowulf systems built for High Performance Computing. In the last few years, Beowulf systems have been built with some success, aiming to replace readymade large-scale supercomputers with “cheap” off-the-shelf microchips. These include very large systems in production at Cornell University, the Pittsburgh Supercomputing Centre with its 5Teraflop/s system and several in the Federal Labs.

This “tailored system” paradigm has also been used to build departmental systems from off-the-shelf components of choice. These typically consist of several hundred or in a few cases 1000s of processors using AMD Opteron, Intel Xeon, IBM Power 4 or Intel Itanium and so on chips, cobbled together with a network interconnect, such as a Gigabit switch, Myrinet or QsNet from Quadrics. But are these really cheap supercomputing alternatives? With the exception of very few systems, these clusters are in reality capacity computers. Even the Pittsburgh system has to be seeing in context. As Ralf Roskies, scientific director of PSC said in his keynote: “With 750 compute nodes, it needs a node failure once a year to reduce the Mean Time Between Failure of the total system to 12 hours and without checkpoint restart software, large applications requiring the whole system have difficulty in reaching completion”. In addition, the cost integral in the educational sites, only include semiconductor components and exclude personnel costs.

The tailored system paradigm is nevertheless spawning a number of small companies, providing build and maintenance services for made-to-order systems. For example, the company ClusterVision, started up a year ago, and now delivers fully functioning systems with all hardware and software integrated and configured for immediate deployment. Although the customer can choose the system component mix, for example, the installed Beowulf system at Utrecht University consists of Intel Xeon processors and InfiniBand from Fabric Networks Inc., whilst the cluster for computational chemistry at the University of Manchester uses 240 AMD Opteron processors with their built-in Hyper-Transport technology. ClusterVision also has a number of partners, which can provide it with high-end componentsand contractual backing. These include, ECL Computers, one of Netherland’s largest PC component distributors and NEC HPC Europe.

In summary, the Beowulf tailored system paradigm is moving into mainstream for a niche market, but there are still some issues to be resolved. Crucially DIY scientists inexperienced with computer installation are unaware and not factoring costs for delivering a robust computing environment, e.g. strong computer floor space, adequate cooling and power supply. With larger clusters, maintenance procedures and a fast interconnect network are often needed and these issues cumulatively exacerbate costs. It would appear that readymade products from vendors, or, systems from specialist companies, such as ClusterVision, developed under strict engineering regimes, could still be better value for money if reliability and integration costs are taken into account.

With many clusters gaining prominent positions in the TOP500 list, those of you who have real need for high productivity capability supercomputing, but short of funds might be tempted to take this path. I leave you with a muse from another age:

All that glisters is not gold.
(Shakespeare: Merchant of Venice, ii, 7).

Wishing all my readers, Seasons Greetings and a Peaceful Happy New Year.


Top of Page

Previous Article   |  Table of Contents  |