HPCwire
 The global publication of record for High Performance Computing / September 26, 2003: Vol. 12, No. 38

  |  Table of Contents  |  

Features:

NEC SX-6 - TWO TIMES MORE COST EFFICIENT THAN IBM P4?
By Christopher Lazou

September 7-11, 2003: As stated in my previous article, some 90 meteorologists and HPC experts from 15 countries and 4 continents attended the bi-annual CAS2K3 workshop on the use of HPC in meteorology, held at the idyllic Imperial Palace Hotel, Annecy, France, organised by the National Centre for Atmospheric Research (NCAR), USA. There were 43 presentations in 4 days and a live demonstration of the Grid's enabling potential for international collaboration within the community of climate system modelling (CCSM). Most presenters came from sites in the USA with large IBM P3/4 systems, while the European contingent included a strong representation from sites with large NEC SX-6 systems.

The first article concentrated on Earth System Modelling and how researchers try to do the best with available computer resources to predict climate change. This second article highlights a few of the many system performance issues raised by presentations given at the CAS2K3 workshop.

Dr. Richard Loft, from NCAR - a Gordon Bell 2001 award winner, opened the debate by addressing the supercomputing challenges at NCAR, reviewing supercomputing trends and constraints. He looked at microprocessor efficiency and assessed what improvements are possible using the IBM P690 as a compute engine. He then showed some RISC/Vector cluster comparisons and came up with some stark conclusions concerning the myths surrounding commodity chip processors.

Climate scientists project a need for 150 times more computing power over the next five years. Doubling the horizontal resolution increases computational costs eightfold. Introducing super-parameterization of moisture processes would increase computing power needs and costs dramatically. He then went on to detail how cooling plant and electrical power as well as computer facility space act as limiting factors at NCAR. For example, NCAR's cooling capacity is 450 tons, requiring 1.58MW electrical power of which 1.2MW is used by the current set of computers. Another major growth area is data generation and the cost of data managing and handling. NCAR has 1.3PetaBytes of data, increasing at the rate of 3TeraBytes a day. Both data size and overall costs of handling are on the increase, setting an alarming trend.

The computing power at NCAR consists of 1024 IBM 690 (1.3GHz) Power 4 cluster, rated as 5.2Teraflop/s peak performance, plus 283 IBM (375MHz) Power 3 cluster, of 1.7Teraflop/s peak.

NCAR analyzed their workload performance and found that newer IBM systems, i.e. the 8-way IBM P4 (at 4.1% efficiency), are less efficient than older ones, i.e. the IBM P3 (at 5.7% efficiency). This probably explains why NERSC chose to upgrade using IBM P3 processors instead of the newer IBM P4. NCAR also found that larger 32-way P4 nodes are more efficient, delivering 4.5%. When they delved under the covers it was found that applications in their workload were memory bandwidth limited. A simple bandwidth model predicts 5.5% efficiency, but the imbalance of the colony switch with the P4 processor speed reduces efficiency further. Latency considerations should allow P processors to scale as log(P). In actuality the P4 scales linearly, which is poor. The NCAR people then did some nifty hand tuning of 3-D FFT kernels and managed to improve their code efficiency from 3.5% to 4.7% when using 64 IBM P4 processors. More detailed charts were presented dealing with various internal aspects of the codes within their workload. What was observed is that the efficiency ranged between 4.1% to 4.5% of peak and the maximum sustained performance on the NCAR workload was 213.5 Gigaflop/s out of 5.234Teraflop/s peak.

Rich Loft went on to calculate the estimated peak price performance on the IBM P4 system at NCAR as $2.6 per Megaflop/s and sustained price performance as $59 per Megaflop/s. The sustained electrical power performance was 0.7Gigaflop/s per kilowatt. He then compared it with the Earth Simulator using Dr. Sato's results and came up with an estimated peak price performance of $8.5 per Megaflop/s and estimated sustained price performance of $28 per Megaflop/s. The sustained electrical power performance was 1.525Gigaflop/s per kilowatt.

He concluded: "at this point in time the vector based Earth Simulator (NEC SX- 6) appears to be twice more cost efficient (dollars/Gigaflop/s) in both price and electrical power performance than the IBM P690 P4, when using sustained performance as a measure". (This is before the 30% price/performance improvement to the SX-6 announced on the 18th September. See NEC Press Release in this publication).

He explained that this is because the NCAR workload is bandwidth limited and RISC cluster (IBM P4) interconnect is not great. He also indicated that infrastructure (power, cooling, space) is becoming a critical constraint for NCAR. The IBM contract at NCAR hit the buffers and had to be renegotiated, to avoid interfering with the IPCC work schedules. For this reason another 448 (1.3GHz) IBM P4 processors, an additional 2.3Tflop/s peak, are being installed (September'03) and the Federation switch is now delayed to 2H04. Thus, the myth that commodity chip computers are cheaper has been once again, debunked.

The message, that capacity computers such as the IBM P4+ systems are unsuitable for high resolution Earth System Modelling, was re-enforced by many of the speakers. For example, as stated in my last article, Dr. Bert Semtner, Naval Postgraduate School, Monterey, and Dr. Bill Collins, NCAR, whose main thrust was that only systems of multiple Teraflop/s sustained performance can be used to project climatic conditions out for many centuries, with highly resolution Earth System Models. For example, using a model of ~6.5Km spacing over the ocean, on a 500 processors IBM P3, it took eight days to simulate fifteen years. This same model simulated 300 years in just eight hours on 960 processors of the Earth Simulator (NEC SX-6).

As far as silicon use, the size of the super-scalar processor chip on the IBM P4 and the vector NEC SX-6 processor chip is about the same, but the IBM P4 has 3 times higher transistor density (170M) as compared to the vector SX-6 chip (57M). This is an instance where more density (with its inherent complexity) means less value.

As reported in my last week's article from CAS2K3, in ESM applications the IBM P4 falls short in capability and from the results reported above, is also poor in cost/performance measures. Utilizing thousands of IBM type processors would not help, according to results from the presentation by Patrick Worley, Oak Ridge National Laboratory. Scaling would act as a major limiting constraint. This is where commodity capacity chip systems are getting problematic. The moral is, keep the architecture simple, with fewer things to get in the way and impede application performance.

To temper what is said above, Bill Kramer from NERSC reported that for some applications the IBM P3 delivers good efficiency. Below is a list of examples given: 1. Tera-scale simulation of Supernova explosions 35% efficiency on 2048 CPUs. 2. Accelerator science and simulations 25% efficiency on 4096 CPUs. 3. Electromagnetic wave-plasma interactions 68% efficiency on 2048 CPUs. 4. Quantum Chromodynamics at high temperatures 13% efficiency on 1024 CPUs. 5. Cosmic Microwave background data analysis 50% efficiency on 4096 CPUs. There are however, no Climate codes on his list yet.

Most sites represented at CAS2K3 were using two contrasting types of platform, namely the vector parallel NEC SX and the IBM scalar P3/4 for weather/climate predictions and data management. A few from SGI, Fujitsu and Cray were also present.

The two main vendors have contrasting business models. The NEC product line as presented by Dr. Joerg Stadler consists of a pyramid with the powerful SX series at its apex, suitable for demanding applications such as high resolution ESM. The NEC TX7 (based on the Intel Itanium line) is offered for servers used for less demanding applications, including data management, as the middle of the pyramid. Clusters of PCs are offered for less demanding applications as a broad base. This allows NEC to deliver suitable systems to match the computational needs of the whole range of applications. Joerg then described how NEC HPC Europe has become a total solutions provider. It used the example of the recently successful installation of the DKRZ system. The solution offered by NEC HPC Europe was to take control of the total system integration process and deliver the service within the agreed budget. This involved using its long-established hardware and software engineering skills to select, install and maintain the total system, for delivering the services to fulfil the DKRZ mission. This included planning support, site specification, capacity analysis, air conditioning, machine and cabling layout, manpower requirements for operation, security strategy.

NEC used its own products, the NEC SX series high compute servers for numerical calculations and for data-handling the AsAmA servers. It also incorporated elements from other hardware and software vendors, StorageTek, the Legato hierarchical file system (GFS) and the ORACLE database running on top of Linux, to deliver an optimal solution.

In contrast the IBM Power 4 product line is used across most application domains. IBM argues that this approach allows it to leverage developments from the large commercial side of its business to benefit the much smaller scientific/technical market.

Dr. Jamshed Mirza, Chief Systems Architect from IBM, in his presentation titled: "The path to Petaflop/s", acknowledged that Earth System Modelling required higher memory bandwidth than currently offered in the IBM P4+ product line. He said the NEC SX-6 has a memory bandwidth ratio of 4Bytes/Flop, the Cray X1 3Bytes/Flop and the IBM P4 1.3Bytes/Flop. He then claimed that super- scalar architectures retrieve results at close to vector rates out of cache. The primary speed differentiation between mainstream RISC super-scalar and vector processors is the memory subsystem. IBM can increase memory bandwidth, but it does not intend to do so. To achieve a vector type balance, both the memory subsystem and the communication switch have to be improved to address bandwidth and latency issues. This is an expensive exercise and IBM was not inclined to go down this road, because it did not make commercial sense when the cost/benefit integral was mapped across the whole of its customer base. So, when push comes to shove, the benefits to the scientific/technical community from the IBM business model are illusory. This was poignantly illustrated in the NCAR presentation described above.

The bottom line is that high resolution Earth System Modelling is severely constrained (some say handicapped) on IBM P4 systems, as it takes far too long to simulate centuries or thousands of years ahead. In addition, from the NCAR results, the IBM solution also appears to be about twice as expensive compared to using the Earth Simulator, which is based on the NEC SX-6.

That U.S. ESM research has ended up relying on IBM systems is a testament to how political interference, banning the import of vector parallel systems from Japan, can be unhelpful, distorting the market. Although this policy was recently formally revoked, it still persists in practice. With the NEC path more or less blocked, the hope for the U.S. ESM community to remedy this situation is to buy parallel vector systems, possibly the newly developed Cray X1 (with some 260 units sold). A Study done at the Army High Performance Computing Research Centre, comparing 1024 CPUs Cray X1 and 5760 (2.8GHz Pentium 4, IA-32) CPUs cluster, concludes that the Cray X1 with $58.65/Mflop/s compares favourably on a price/performance basis with the least expensive cluster over a 5-year life cycle (i.e. is cheaper), without any capability benefits taken into consideration. Benchmark measurements from work done at Oak Ridge also show the Cray X1 in good light, performing well with efficiencies similar to the Fujitsu VPP5000 for up to 64 processors. It appears to have slightly lower performance (24GB/s) memory bandwidth than the NEC SX-6 (31.8GB/s), as measured on a Stream Triad and about an order of magnitude higher than the IBM P4, confirming the Bytes/Flop analysis by Jamshed above. More important than cost/Megaflop/s, the Cray X1 would provide capability, enabling them to perform high-resolution ESM simulations for long- term predictions.

The Canadian weather centre reported that NEC lost their site to IBM during the recent procurement. About the same time the UK Met Office and the Australian Bureau of Meteorology chose NEC systems. I asked if NEC Japan was involved in bidding in Canada or whether it was Cray, under the NEC/Cray agreement for Cray to sell NEC systems in North America. The presenter from Canada refused to answer, the only unanswered question at CAS2K3. I leave it to the reader to speculate. The answer may of course be present in the question.


The views expressed in this article are those of the author alone and do not necessarily reflect those of HPCwire, its publisher, or its staff.

(Brands and names are the property of their respective owners) Copyright: Christopher Lazou, HiPerCom Consultants, Ltd., UK. Email: Chris@lazou.demon.co.uk September 2003.


Top of Page

  |  Table of Contents  |