
Features:
IS WINDOWS NT SERVER AN OS FOR HPC?
by Christopher Lazou
In the technology field, vendors are naturally likely to enthuse on the merits
of their new products. The usual suspects are expected to claim their
transistor-based systems are delivering optimal solutions for various
application domains. Similarly, high-capability systems, based on high memory
bandwidth and inter-processor communication, offered by NEC and Cray would
highlight productivity, higher sustained performance for large complex
applications, as their forte.
Some others would argue that the HPC business battlefield is shifting to the
design of specialised chipsets, using the Intel Itanium family of chips, or,
in building clusters with other commodity chips using third party networks and
the capability top end is of marginal importance.
In the application areas many scientific technical presentations are
concentrated into sessions. One of these sessions deals with requirements for
HPC Operating Systems. There are some new developments, concerning operating
systems for HPC, which are challenging conventional wisdom. In the late 1990s,
the Cornell Theory Center (CTC) embarked upon a unique enterprise in the area
of high performance computing. Instead of following the mainstream and
adopting some form of Unix/Linux as its operating platform, CTC made a
conscious decision and chose Windows NT Server.
Dr. Gerd Heber, senior research associate in the Cornell Fracture Group at the
Cornell Theory Center, Ithaca, New York, USA, is giving a talk about their
experiences. In an interview, I asked Gerd a series of questions and below is
an extract from that interview, a few of the answers to give you a flavour of
his talk.
Question 1: The conventional wisdom is to use highly tuned operating systems
with rich functionality, but focused in scientific technical applications as a
vehicle for future developments, yet you at Cornell chose Windows NT server.
Would you briefly explain the rationale behind this decision?
Unlike operating systems for real-time processing or embedded systems, the
operating systems used in scientific and technical computing are fairly
general purpose. Except for say I/O operations or multithreading, any OS is
more or less "in-the-way" of an application. The systems in the Windows Server
family are tuned enough to get a cluster on to the Top500 list or to the top
of the TPC benchmarks. Tests performed by us in house or by others have shown
that on identical hardware, applications running on a Windows Server based
platform will perform the same or better than on other operating systems.
Question 2: As I remember in the mid-nineties Cornell was one of the NSF
supercomputing centers, operating the largest IBM SP system in the world. A
switch to Windows could not have been easy and how did your users react to
this radical departure from the mainstream?
The switch was indeed not easy and the prevailing initial reaction from users
(including myself) was skepticism, in some cases hostility. From a user’s
perspective, what made it difficult were the different development environment
and the lack of certain libraries and tools. For example, we had a
considerable investment into project management using make and CVS, and we had
no intention to change this. The C++ compiler, which shipped with Visual
Studio 6.0, 5 years ago, was a decent C compiler, but did not deserve to be
called a C++ compiler. After years of development with Kuck & Associates’ KCC
compiler, we spent several months looking around for and testing all kinds of
C++ compilers, just to get our code built for Windows. Fortunately, the code
had been tested extensively on the SP and performed correctly once built.
Tools like Cygwin and GCC were indispensable in those early days.
Question 3: As one is aware multi-scale systems throw many different problems
to those of medium size systems, what for example were the key challenges
while developing Windows NT for its new role?
Making sure that all necessary things are in place for users, developers, and
administrators was perhaps the key challenge. There were quite large Windows
installations (Windows domains of thousands of servers) out there when we made
the transition, but running them as HPC clusters with all its specific
requirements had not been attempted yet. On the other hand, the administrative
staff at CTC consisted of fabulous AIX administrators, who had only a vague
idea of doing things the "Windows way". What followed was for both users and
administrators a "soul searching process" at the end of which both arrived at
the same conclusion: "You must change your life." (Rilke) - Emulating or
mimicking UNIX on Windows gets you only so far.
Question 4: To successfully solve multi-scale, multi-physics application
problems, system robustness, fault tolerance and computation integrity
(checkpoint restart) are essential. For example in a system with 2,000 nodes
if a node fails once every two years the system fails about three times a day
and most calculations take much longer than that, so how are you tackling
these challenges?
All our work is done using industry standard soft- and hardware. Despite the
lack of out-of-the-box support for check pointing we think application-level
check pointing is the most promising approach. Of course, this requires more
or less intervention from the user, but at the same time minimizes the amount
of data associated with a checkpoint. The Intelligent Software Systems project
at the Cornell CS department implemented a very convenient compiler based-
approach for MPI and OpenMP programs. In my own work, I use databases to store
sufficient application state- and history information, which also allows
restarting on a different number of nodes.
Answers to questions 5, 6, 7, 8, 9......., will be published in the near
future.
|