
Features:
HPCX INDUSTRY DAY AT CCLRC DARESBURY LABORATORY
by Christopher Lazou
CCLRC Daresbury Laboratory hosted a one-day Forum in HPC for Industry. Over 50
people attended this event, made up of potential users of the HPCx service
from the industrial sector, a selection of software and hardware vendors,
academic researchers and Research Council officials with high-performance
computing interests. The talks covered areas of computational engineering,
life sciences, environment, materials and chemistry simulations. The focus was
in reviewing the productivity impact that systems with sustained performances
of 1Teraflop/s, 10Teraflop/s and even 100Teraflop/s, would have on industrial
R&D applications.
The lofty objectives of the meeting were: To introduce, raise awareness and
demonstrate how Terascale class systems such as HPCx and successive generation
systems can meet the challenges of industrial R&D; promote the skill-base
available in HPCx for efficiently and effectively exploiting HPC systems,
developing new scientific methodologies and simulation technologies; and,
explore the scale and scope of potential commercial interest in the HPCx
service and successor facilities.
Daresbury houses a large-scale system, named HPCx, owned by a consortium
consisting of the University of Edinburgh, EPCC, CCLRC and IBM. It is the
UK's premier academic research computing service. Its mission is to create a
world-class organisation for leading edge capability class simulations
requiring access to the highest levels of computational performance.
This is not an idle boast. An extensive study simulating amphiphilic fluids,
uncovered some very novel behaviour, including the self-assembly of the
beautiful liquid crystalline gyroid phase. This was run on the Reality in
Phoenix, Arizona, using 1024 CPUs on the HPCx system plus 2048 CPUs at PSC,
Pittsburgh and other resources, performing the biggest Lattice-Boltzmann
simulation in the world and for its innovative results, was awarded a Gold
Star, at SC2003 last autumn.
To give a flavour of computer resources needed for this type of application,
here is an example from another field using the NAMD code. This is a large-
scale molecular dynamics simulation of the interactions between a T-cell and
an antigen-presenting cell, the TCR-peptide-MHC complex, using 96,796 atoms.
For a 2ps simulation it requires 20,000 CPU seconds using 1,024 IBM P4
processors. A 1nanosecond simulation can be performed in 10hours elapsed time
instead of months and this means that progress on tackling large-scale
Molecular Dynamics problems have been much faster than originally anticipated.
The HPCx system comprises 1,280 IBM Power 4 p690 processors with 1Terabyte of
memory, the "Colony" switch and 18Terabytes of high-speed disk.
The hardware is being replaced with 1,536 IBM P690+, 1.7GHz, 1.5Terabytes of
memory and the "Federation" High Performance Switch, HPS. This enhanced system
with a peak performance of 10.44Teraflop/s is expected to be fully operational
this June, promises to deliver about 6 Teraflop/s performance on Linpack,
placing it for the moment, in the top ten most powerful academic research
supercomputers in the world. How much of this potential performance is
delivered to the user application as productive work is the biggest challenge
facing computing service providers.
The final phase for this 85Million US dollars system is an upgrade in year
2006 to provide a 22Teraflop/s peak performance. By year 2006, a new system
should also be in place, purchased under a new competitive tender, codenamed
"HECTOR" and costing an estimate 200-300 Million US dollars, amortised over
six years.
In the Daresbury upgrade, OS software were also upgraded; for example, PSSP
has to be replaced by CSM to run the GPFS file system on the new p690+/HPS
hardware and the internal frame partitioning had to be changed from the 4x8-
way LPAR to 1x32-way LPAR. Incidentally, the replacement of the "Colony"
switch with HPS reduces latency from 10usec to 8usec, so the system should be
more balanced than in the past.
However, to make full use of scalar parallel computers with thousands of
processors, computational scientists and engineers are faced with the daunting
task of addressing major challenges of managing memory hierarchy, as in the
IBM P690 family of processors, of expressing and managing concurrency in their
application codes and using optimisation techniques to achieve efficient
sequential execution. With complex systems a small mismatch can easily be
magnified to massive bottlenecks, which substantially reduce efficiency.
As in many other supercomputer centres, Daresbury set up an HPCx Terascaling
team of computer scientists, to tackle the "performance gap" problem. The team
led by Martyn Guest, collaborates with consortia developing large application
codes targeting modifications to enable these codes to use 1,000 or more
processors, "efficiently".
Martyn, started his talk by listing some of the existing large-scale parallel
computers, mainly in the Federal Labs and NSF sites in the USA, build with
mainly IBM p-series commodity compute servers tied by relatively high
communication fabric. According to Martyn, the planned 100Teraflop/s ASCI
Purple, based on the IBM Power 5 processor seems to be a limiting plateau in
the evolution of parallel scalar computer architectures of this type. Hence,
the new PetaOPs architecture projects are looking at alternative paradigms.
For example, the IBM BG/L project envisages a cellular architecture with 100
thousands or more CPUs and is intended to be for general purpose. Other
approaches, with R&D funding from DARPA, include the Cascade project at Cray
and new hardware developments by Sun Microsystems.
In the past 10 years, peak performance on supercomputers increased
hundredfold; in the next 5+ years it is likely to increase by another 1000
times, but efficiency has declined from 30-40%, common in the 1990s and still
common today on vector supercomputers from NEC, to as little as 5-10% on
parallel scalar supercomputers of today and may decrease even further, 1-3% on
future scalar machines. This is the so called "performance gap" crisis.
The biggest conundrum in large-scale computing is that an increase by O(N) of
the size of the science/engineering problem to be studied, classical
algorithms require O(N square) or even O(N cube) computation resources, to
perform the simulation.
The research challenge is therefore for new software, implementing new
algorithms matching present and future hardware, enabling scientific codes to
model and simulate physical processes and systems at near linear scaling. This
is a continuous challenge; as computer architectures undergo fundamental
changes, numerical algorithms need to track them and scale linearly, to enable
the use of thousands and even millions of processors.
Martyn went on to say, on present scalar IBM hardware, one is faced with
managing thousands of CPUs and a memory hierarchy, using commodity DRAM with
communication fabric far slower than access to local memory. Each CPU has
registers, on and off chip caches, main memory and "virtual" memory (disks).
Each level requires more time to access (latency) and has slower transfer
rates (bandwidth). A parallel computer adds an additional level, that of
remote main memory.
A programming model based on non-uniform memory access (NUMA) explicitly
recognise this hierarchy, allows performance improvements on sequential
algorithms to be applied directly to parallel algorithms, e.g. data blocking.
NUMA algorithms are typically more efficient and easier to design than those
based on MPI and/or OpenMP.
Another challenge is the expression and management of concurrency of the
application. How much parallelism is needed and at what level of granularity?
David Bailey (as early as1997) showed that the minimum level of concurrency
(Lc) needed to sustain a given level of performance, P, on a single processor
is: Lc=P x Lm, where Lm is the memory latency on the processor node. For a
TeraOps computer build with commodity (100ns) DRAM memory, Lc = 100,000.
A more detail analysis of the computer determines how much latency must be
coarse or fine grain. For example, if each processor in the TeraOps has peak
speed of 1GigaOp, then, within each processor the algorithm must provide a
fine-grain concurrency of at least 100 to support 1GigaOp. The additional
factor of 1,000 must come from coarse-grain parallelism.
Another question is how coarse is coarse-grain? Apply the same formula, where
latency Ln, is now the latency for accessing remote memory and using 10us for
current networks, 256MB/sec for bandwidth, for short messages experiencing
full latency, Ln is 100,000. Large messages (average 1/bandwidth) which brings
coarse-grain to ~ 312, on the IBM P4+, so on this system large messages must
be the rule.
The above analysis tells us that fine grain parallelism is essential to obtain
efficient serial execution, avoid mismatch between processor speed and that of
memory subsystem. Granularity of coarse grain parallelism is determined by the
ratio: "single processor speed divided by the average latency of remote memory
references". Latency is determined by the characteristics of both the
communication hardware and the algorithm used.
Given the different objectives of fine and coarse grain parallelism, is it
reasonable to combine them into a single programming model? The above
discussion was dominated by considering different levels of memory hierarchy,
so that same mechanisms (e.g. blocking of data to increase reuse) apply to
both. Currently portable parallel environments do not provide this unity.
Manually expressed coarse grain parallelism (e.g. with MPI or Global Arrays),
rely upon compilers or libraries (e.g. BLAS) to take care of fine grain
parallelism.
Using the theoretical background described above, Martyn and his team
identified and selected a number of large-scale application codes for the full
treatment. These included codes that are insensitive to the communication
fabric, e.g. computational engineering and DNS methods, environmental
modelling -POLCOMS, codes requiring concurrency management, i.e. migration
from replicated to distributed data, DL-POLY (molecular simulation), CRYSTAL
(electronic structure), optimisation of communication collectives (e.g.
MPI_ALLTOALLV and CASTEP), memory driven approaches, scientific drivers suited
to capability computing, enhanced sampling and replica methods and so on.
Their preliminary results show that they were able to improve the codes to
deliver 8-16% of the peak performance. The maximum performance achieved was
1Teraflop/s for a system consisting 1,000 atoms SiC super-cell, (256x256x256
Mesh) run on CPMD, developed at IBM Zurich from the original Car-Parrinello
code in 1993 and using all 1280 IBM P4 processors of the HPCx system. The
mixed approach was instrumental in obtaining these results; larger SMPs and
faster switches should deliver better results.
Apart from the examples mentioned above the speakers presented results from
CFD simulations, enzyme, cell membrane protein interactions and some of the
work done in developing new software technologies, reformulating the classical
quantum mechanical methods to linear scaling. This new method, incorporated in
the Cambridge Serial Total Energy Package (CASTEP), claims to enable
simulation of 20,000 atoms in less than a day, on an HPCx phase2 size system
The results presented in this forum demonstrate the importance of capability
computing, enabling scientific simulation practitioners to tackle larger more
realistic problems and reducing the time to completion by at least an order of
magnitude (from say a year to one month), bringing it within a "reasonable"
timeframe. From the evidence, vector parallel systems are way ahead in the
productivity stakes, at least an order of magnitude more productive than their
scalar cousins.
(Brands and names are the property of their respective owners) Copyright:
Christopher Lazou, HiPerCom Consultants, Ltd., UK. April 2004.
|