
Features:
THE ROAD TO 99.99% AVAILABILITY, MGMNT FROM THE GROUND UP
By Dr. Paul Terry, CTO, OctigaBay Systems
HPC application uptime is critical for users. Whether testing the structural
integrity of automobiles in crash simulations or replicating DNA strands in
cancer research, HPC users look for application results in times ranging from
milliseconds to months. So it’s no wonder that pushing product reliability
beyond 99 percent uptime is a constant endeavor for designers of HPC systems.
Think about it, 99 percent uptime over the course of a week leaves an unsavory
100 minutes of downtime impacting an HPC application. Imagine facing the
equivalent “blue screen of death” when picking up the handset to the
telephone, or pressing the accelerator in a car. Very few of us would accept
this standard in any other product or service; it’s time to look at strategies
to bring that 4,800 minutes of downtime per year to a more acceptable
figure—less than 1 hour.
In a recent University of California, Berkeley study, failures of HPC systems
were attributed to the three usual suspects—hardware, software and operator
error. Hardware accounts for 15 percent of system outages, with disk failure
as the most frequent culprit, followed by power supplies and fans. Software
is responsible for 34 percent of system outages, lead by driver problems,
version incompatibilities, and corrupt or stale software. Operator errors
account for 51 percent of HPC outages, mainly due to the complexity of the
system, the normal incompatibles with software and components, and the manual
nature of many HPC programming and maintenance tasks. Installation,
configuration, upgrades and problem diagnosis are the most frequent times when
outages occur, and these are nearly all manual processes. Whether the outage
is caused by one or all of these factors, one thing is clear: the numbers
warrant improvement.
Ultimately, decreasing downtime is about preventing failures whenever
possible, and about recovering quickly when failures do occur. Strategies for
HPC system design that increase reliability and resiliency include:
Thermal design HPC systems need to be designed to handle the thermals of
thousands of processors running in parallel. While keeping the processors
cool is important, it is more imperative to prevent thermal cycling (the
change in temperatures) of components, since thermal cycling can drastically
reduce a component’s lifespan. It’s critical to maintain a constant
temperature within the system to prevent thermal cycling and the attendant
thermal stress. Reducing thermal cycling improves MTTF.
Variability reduction Variability is the enemy of reliability. Clusters today
can be comprised of hardware from more than four different vendors, and
software from more than a dozen different vendors, leading to obvious, and
often painful, integration issues. This results in difficulty testing the
different components and packages, problem isolation between vendor packages,
version control issues and a plethora of other administration issues.
Manufacturers that ensure tight integration between the hardware and
management software achieve significantly higher levels of system reliability.
Single system command & control Today administrators must log into hundreds,
even thousands of processors to view job status, check system health or update
operating system software. Imagine, instead, logging into one processor and
issuing one command to rollout a new release across multiple processors with
an absolute guarantee of no mis-configurations. And imagine also having the
ability to automatically rollback the installation of that release to all or
some of the processors. These capabilities require intelligent, topology
aware software that provides a single system view. This software must be
underpinned by sophisticated transaction processing techniques. It must
automate and simplify many common administrative functions— including
configuration, software upgrades, network, storage and user management, as
well as security, resource and queue management. This single system command
and control is even better if it provides the choice for administrators to
access the system through an intuitive graphical user interface or a command
line interface.
SELF-MONITORING & SELF-HEALING SYSTEMS
The first requirement for system
resiliency is to monitor all aspects of the HPC system -- hardware and
software. For an HPC system to be self-monitoring, it needs to have an
independent supervisory network, a dedicated management processor with its own
robust operating system and fault management software that all work together
to constantly monitor the system to maximize reliability and availability. The
dedicated management network runs full background diagnostics to monitor the
system’s health—including temperature, air velocity, fan speed, voltages,
currents and others—without placing an overhead on processor performance. The
sanity of the OS and other key internal services, like DNS, LDAP and NIS can
also be self-monitored. Self-monitoring leads to self-healing—the system
proactively taking actions according to established policies, then isolating,
correcting and/or routing around problems to achieve complete system
resilience. For example, if a CPU is overheating, the system can temporarily
stop the job running on that CPU, take that CPU out of service, and add an
unallocated CPU into that partition to get the job up and running again. In
the meantime, the fans would automatically increase speed in an attempt to
cool the overheating CPU. Continuous job operation is particularly
significant for those HPC applications that take days, weeks, and even months
for completion. Between jobs, its also important for software to
automatically refresh, further preventing system failures due to stale
software.
Reliability, availability and serviceability in high performance computing is
the area with the lowest customer satisfaction. There are significant cost
and performance implications to outages, driving the need to increase system
reliability and resiliency. Management is more than a piece of software; it
requires conscious thought and design in the hardware, firmware and software
of a system. When an HPC system is viewed as a single entity, as opposed to a
conglomerate of individual pieces, and designed from the ground up to achieve
reliability and resiliency, system availability can more than triple. As
manufacturers implement these strategies, four “nines” (99.99 percent) of
reliability will change from something HPC users only imagine to a must-have
requirement.
About the Author
Dr. Paul Terry is the Chief Technology Officer and co-founder for OctigaBay
Systems, an innovative new company in high performance computing. Dr. Terry
oversees strategic direction for OctigaBay products and technologies and is
the chief visionary behind a new HPC architecture, the Direct Connected
Processor architecture. Dr. Terry holds a First Class BSc. Class 1 Honors
degree in Physics with Electronics, and a PhD in Electronics and Electrical
Engineering from the University of Liverpool in England. He earned his MBA
from Cranfield University in England.
Sources
Dave Patterson and Aaron Brown, University of California at Berkeley in
cooperation with Armando Fox, Stanford University, Recovery-Oriented Computing
|