HPCwire
 The global publication of record for High Performance Computing / December 17, 2004: Vol. 13, No. 50

  |  Table of Contents  |  

Features:

MACHINE EVALUATIONS WORKSHOP: CHALLENGES OF MANAGING CHANGE
by Christopher Lazou, HiPerCom Consultants, Ltd.

About 250 people attended the 15th machine evaluation workshop at EPSRC Daresbury Laboratories, UK. This excellent Workshop is now established as a leading UK national event dedicated to, distributed high performance scientific computing. The principle objective is to encourage close contact between the research communities from the Mathematics, Chemistry, Physics, Engineering and Materials Programmes of EPSRC and the major vendors of mid- range computing systems, workstations, servers, software and peripherals.

Most of the 39 presentations were from vendors, describing their own products, on topics such as hardware, compilers, graphics, storage and networking. They focused on cluster solutions, based on commodity chips, interconnect networks and associated file storage systems. An important component of the workshop is the availability of systems for benchmarking evaluation purposes.

There were exhibitions and presentations from more than twenty companies, keen to promote their ready-made products including those based on the Intel Itanium2 processor. A strong presence of AMD Opteron and Intel Xeon systems as well as various models of Blade products were on display and available for demonstrations.

The first presentation was given by Rick Kufrin (NCSA) titled: "On The Trail Of Performance: Recent Developments At NCSA". This was an interesting talk in that it discussed NCSA experiences in their attempt to provide a user service on a number of large clusters (of total performance >35Tflop/s peak) using various different off-the-shelf compute processors and interconnect components.

The fact that these clusters were built from the newest components arriving on the market makes them somewhat experimental and some teething problems were to be expected, but the severity of the problems encountered was an eye opener. Rick listed a number of challenges as rated by their users when surveyed. Broken micro-code (in CPU) received the highest rate severity 5++, for widespread chaotic node instability. For at least one year nothing seemed to work. Making attempts to get anything to work or troubleshoot them when they didn't was futile. High performance parallel file systems attached to the cluster received severity 4 rating, with general instability having the potential to take down the entire cluster. High performance bandwidth networking was yet another with severity 3 rating. Hardware failure, cable breakage and inability to scale, constant bug reporting (3 major upgrades in 6months) is just few of the many problems encountered. They started with revision 2.0.4 and are now running on 2.0.12 pre-release.

During the last two years NCSA made the transition from shared memory traditional supercomputers to cluster technology, a major paradigm shift. They are now looking at measurements to ascertain the aggregate performance of all user applications on Linux clusters including the IBM 690 and assess whether they are really getting value for their Total Cost of Ownership (TCO) money. The initial results showed that only 12% of jobs achieved 10% or more of peak performance. One wanders if the project takes account of the wasted time of hundreds of users, over the year of instability, in their TCO integral.

As in previous years, of great interest were the Daresbury Benchmark results. These consisted of a plethora of distributed memory benchmark results, compiled by Martyn Guest and his team from Daresbury, from many systems including the latest products from vendors using their latest chips. The Daresbury benchmark suite, used to obtain these results, consists of many computational chemistry kernel codes, molecular dynamics, Quantum Monte Carlo, Jacobi Solver, STREAM - measured sustainable memory bandwidth in HPC (TRIAD), the Ab Initio molecular electronic structure DL_POLY and the parallel molecular dynamics benchmark. The results from SPECfp2000 and other well-known benchmarks were also presented.

Martyn Guest gave a similar talk as last year using this year's performance results normalised on last year's best PC based system, the HP RX5670 Itanium2/1.5GHz. Martyn emphasised that single processors are complex and often provide misleading results as they are almost always used in n-way nodes. If they are dual core, use both processors, if four-way cores, use all four so that interactions of cache memory and communications are accounted in the performance measure. For example, in the case of the 32-way IBM P4 this normalises availability of L3 memory, which reduces performance compared to using all L3 memory for one CPU. Note also that the results above reflect performance in computational chemistry and this does not necessarily reflect the performance of these computers in all application domains. Furthermore, the results given are not normalised on price/performance, so no specific value-for-money comparisons are made or implied.

Looking at SPECfp 2000 relative to the HP RX5670 Itanium2/1.5GHz, the SGI Altix 3700 Bx2/1600-9M Itanium2 system is 26% faster and the IBM e-server P5 570/1900 is 28% faster, while the HP RX4640 Itanium2/1600-9M is 29% faster.

The benchmark results compiled by the Daresbury team indicate that the Intel Itanium2 systems are still the bright stars of today, but the IBM e-server P5 570/1900 is not far behind. They fare well as far as performance is concerned compared to other super-scalar chips, although one can detect a trend towards convergence. Performance of a particular chip tends to vary on different benchmarks and the version of compiler is run on, but one can see a pattern emerging. To give you a flavour, the summary index, normalised to last year's best PC based system, i.e. the HP RX5670 Itanium2 (1.5GHz), -- for the Matrix- 97, Chemistry Kernels, GAMES-UK and DL-POLY benchmarks -- was found to be as follows:

  • the HP RX5670 Itanium2/1.5GHz (100)
  • the Pentium 4 Xeon(EM64T)/3600 (91)
  • the AMD Opteron 850/2400 (92)
  • the IBM p-series P4+/1.7GHz (82)
  • the SGI Altix3700 Itanium2/1.5GHz (99)
  • Intel Tiger Itanium2/1500 (93)
  • the e-server IBM P5 570/1.9GHz (103) and
  • HP RX1620 Itanium2/1600-H (119)

The rest of the workshop consisted of presentations from vendors, users sharing their experience and presentations from a number of companies, specialising in providing tailored system solutions from commodity components on demand. Instead of buying pre-packaged products from traditional vendors, a contract is placed with a small computer integration company to built a cluster from favoured chips and an interconnect network, such as Gigabit, QsNet, Infiniband or Myrinet.

The presentations focused on cluster systems affordable by academic deparents and associated components, which make the deparental computing environment. For example, John Watts from IBM concentrated on visualization and how it is perceived as a catalyst for growth. He described their new Deep Computing Visualization (DCV) product to be released in 1H2005. This product integrates in one graphic card, realisation of geometry, raster and graphic display. The oil industry is seen as a major market as visualization is the process, which transforms seismic data into insights. John in the end did not resist the temptation to gloat about IBM having the top spot on the TOP500 list.

Crispin Keable from SGI also emphasised visualizing terascale datasets and described how SGI is planning to move to a heterogeneous re-configurable computing. The SGI low-latency interconnect fabric allows incorporation of commodity CPUs, graphic cards, FPGAs and so on. The SGI PRISM is such an example, introduces a graphic card in an SGI node, allowing scaling of visualization in the same way as CPUs in the past. Several graphic cards integrated into the computing system, enables vision of petascale data. FPGAS are also becoming popular, but software libraries for seamless use are not yet available. SGI roadmap envisages a heterogeneous system with globally addressable space, low latencies, high bandwidth and fast communication interconnects.

Simon McIntosh Smith from ClearSpeed, presented their new co-processor, the CSX600, suitable for offloading compute-intense math library functions from serial CPUs. He claimed that: "each CSX600 co-processor can sustain 25Gflop/s on DGEMM, while consuming only 5Watts of power. The CSX600 is a SIMD array of processors. Each PE is a VLIW processor with a multiple execution floating point adder and floating point multiplier in both 32bit and 64bit IEEE754 standard. It is expected to be released 1H2005.

A two-chip board delivers 50Gflop/s peak and uses 10Watts in total. It has up to 1GB shared DRAM for local processing. A dual CSX600 using PCI-X, 20Watts/card accelerator board gives 100Gflop/s peak on DGEMM, BLAS, LAPACK, GROMACS, CHARMM NAND and so on. It is suitable for computation in finance, Monte Carlo, Generic math, providing transparent acceleration for packages such as Matlab, Mathematica, NAG, Maple and entire applications in biochemistry.

Another product line, gaining in popularity, consists of systems using commodity chips and a proprietary chipset with unique features, to achieve a tighter internal integration. These systems based on Blades, deliver much higher bandwidth and lower latency than typical cluster systems using off-the- shelf interconnects available today. One such example, presented at this workshop by Amar Shan from Cray and described below is the Cray XD1.

The Cray XD1 is one of Cray's three product lines and is priced from under US$100K, a new milestone for Cray. It is built in a modular fashion, with 12 AMD Opteron (2.2GHz) processors in a chassis. These are organised as six 2-way Blade SMPs and rated at 53Gflop/s peak performance per chassis. Up to 12 chassis (633Gflop/s peak) can be installed in a rack. Multirack configurations integrate hundreds of processors into a single system running Linux.

The Cray XD1 is purpose-built and optimised for high performance workloads with system-wide process synchronisation. Its Opteron processors are directly connected via its own RapidArray Interconnect (1TB/s Cross Bar switch), which consists of 12 custom communication processors, with 96GB/s non-blocking switching fabric, per chassis. This delivers 8GB/s bandwidth between SMPs and according to Cray 1.6microsecond MPI latency. Each chassis presents 24 RapidArray inter-chassis links with an aggregate 48GB/s bandwidth.

The high bandwidth and low latency features were historically associated with high productivity vector systems, but the Cray XD1 has a couple of other innovative features. It provides six Xilinx Virtex-II Pro Field Programmable Gate Arrays (FPGAs) per chassis attached to the RapidArray fabric for massively parallel execution of critical algorithm components.

FPGAs are part of a class of devices known as PLDs (Programmable Logic Devices), which can be programmed in the field after manufacture. In the Cray XD1, FPGA is tightly coupled to the Opteron, acts like a programmable co- processor and performs vector operations. It is well suited for: searching and sorting, signal processing, audio/video/image manipulation, encryption, error correction, code/decode, packet processing, random number generation and so on. According to Amar Shan, it promises orders of magnitude performance improvement for target applications. For example, FPGA implementation of RC5 Cipher Breaking is a 1000x faster than on 2.4GHz Pentium4. For Elliptic Curve Cryptography, it is 895-1300x faster than 1GHz Pentium3. When performing vehicular traffic simulation it is 300x faster on Xilinx Virtex II (XC2V6000) relative to 1.7 GHz Xeon and the Virtex II Pro (XC2VP100) is 650x faster relative to 1.7GHz Xeon.

A note of caution: At present this requires the user to program at very low level, as there are no high level libraries to provide transparency to the user application.

The other innovation in the Cray XD1 panoply is its active management subsystem. This provides a single system command and control mechanism, for functions such as system configuration, monitoring and fault tolerance, software upgrades, storage management, network management, user management, security and resource and queue management. It also provides extensive fault detection, isolation and prediction capabilities, coupled with automated proactive and reactive self-healing intelligence.

Proactive management measures improve system resilience and Mean Time Between Failure, (MTBF). The redundancy features allow fast recovery from hardware failure, enabling a seamless online replacement of an SMP restoring full capacity to the affected partition. Jobs are automatically rescheduled from the last checkpoint.

In summary, the Cray XD1 main features consist of high bandwidth, low latency compute servers, applications accelerators (FPGAs) and active management functions to provide system flexibility and self configuring, self monitoring, self healing resilience.

As it can be seen from the above Cray XD1 description, a journey need not take the painful cluster road, full of potholes and severity challenges. There are other more productive roads to travel.

Wishing all my readers, Seasons Greetings and a Peaceful Happy New Year.

(Brands and names are the property of their respective owners) Copyright: Christopher Lazou, HiPerCom Consultants, Ltd., UK. December 2004.


Top of Page

  |  Table of Contents  |