HPCwire
 The global publication of record for High Performance Computing / February 6, 2004: Vol. 13, No. 5

  |  Table of Contents  |  

Features:

THE ROAD TO 99.99% AVAILABILITY, MGMNT FROM THE GROUND UP
By Dr. Paul Terry, CTO, OctigaBay Systems

HPC application uptime is critical for users. Whether testing the structural integrity of automobiles in crash simulations or replicating DNA strands in cancer research, HPC users look for application results in times ranging from milliseconds to months. So it’s no wonder that pushing product reliability beyond 99 percent uptime is a constant endeavor for designers of HPC systems. Think about it, 99 percent uptime over the course of a week leaves an unsavory 100 minutes of downtime impacting an HPC application. Imagine facing the equivalent “blue screen of death” when picking up the handset to the telephone, or pressing the accelerator in a car. Very few of us would accept this standard in any other product or service; it’s time to look at strategies to bring that 4,800 minutes of downtime per year to a more acceptable figure—less than 1 hour.

In a recent University of California, Berkeley study, failures of HPC systems were attributed to the three usual suspects—hardware, software and operator error. Hardware accounts for 15 percent of system outages, with disk failure as the most frequent culprit, followed by power supplies and fans. Software is responsible for 34 percent of system outages, lead by driver problems, version incompatibilities, and corrupt or stale software. Operator errors account for 51 percent of HPC outages, mainly due to the complexity of the system, the normal incompatibles with software and components, and the manual nature of many HPC programming and maintenance tasks. Installation, configuration, upgrades and problem diagnosis are the most frequent times when outages occur, and these are nearly all manual processes. Whether the outage is caused by one or all of these factors, one thing is clear: the numbers warrant improvement.

Ultimately, decreasing downtime is about preventing failures whenever possible, and about recovering quickly when failures do occur. Strategies for HPC system design that increase reliability and resiliency include:

Thermal design HPC systems need to be designed to handle the thermals of thousands of processors running in parallel. While keeping the processors cool is important, it is more imperative to prevent thermal cycling (the change in temperatures) of components, since thermal cycling can drastically reduce a component’s lifespan. It’s critical to maintain a constant temperature within the system to prevent thermal cycling and the attendant thermal stress. Reducing thermal cycling improves MTTF.

Variability reduction Variability is the enemy of reliability. Clusters today can be comprised of hardware from more than four different vendors, and software from more than a dozen different vendors, leading to obvious, and often painful, integration issues. This results in difficulty testing the different components and packages, problem isolation between vendor packages, version control issues and a plethora of other administration issues. Manufacturers that ensure tight integration between the hardware and management software achieve significantly higher levels of system reliability.

Single system command & control Today administrators must log into hundreds, even thousands of processors to view job status, check system health or update operating system software. Imagine, instead, logging into one processor and issuing one command to rollout a new release across multiple processors with an absolute guarantee of no mis-configurations. And imagine also having the ability to automatically rollback the installation of that release to all or some of the processors. These capabilities require intelligent, topology aware software that provides a single system view. This software must be underpinned by sophisticated transaction processing techniques. It must automate and simplify many common administrative functions— including configuration, software upgrades, network, storage and user management, as well as security, resource and queue management. This single system command and control is even better if it provides the choice for administrators to access the system through an intuitive graphical user interface or a command line interface.

SELF-MONITORING & SELF-HEALING SYSTEMS

The first requirement for system resiliency is to monitor all aspects of the HPC system -- hardware and software. For an HPC system to be self-monitoring, it needs to have an independent supervisory network, a dedicated management processor with its own robust operating system and fault management software that all work together to constantly monitor the system to maximize reliability and availability. The dedicated management network runs full background diagnostics to monitor the system’s health—including temperature, air velocity, fan speed, voltages, currents and others—without placing an overhead on processor performance. The sanity of the OS and other key internal services, like DNS, LDAP and NIS can also be self-monitored. Self-monitoring leads to self-healing—the system proactively taking actions according to established policies, then isolating, correcting and/or routing around problems to achieve complete system resilience. For example, if a CPU is overheating, the system can temporarily stop the job running on that CPU, take that CPU out of service, and add an unallocated CPU into that partition to get the job up and running again. In the meantime, the fans would automatically increase speed in an attempt to cool the overheating CPU. Continuous job operation is particularly significant for those HPC applications that take days, weeks, and even months for completion. Between jobs, its also important for software to automatically refresh, further preventing system failures due to stale software.

Reliability, availability and serviceability in high performance computing is the area with the lowest customer satisfaction. There are significant cost and performance implications to outages, driving the need to increase system reliability and resiliency. Management is more than a piece of software; it requires conscious thought and design in the hardware, firmware and software of a system. When an HPC system is viewed as a single entity, as opposed to a conglomerate of individual pieces, and designed from the ground up to achieve reliability and resiliency, system availability can more than triple. As manufacturers implement these strategies, four “nines” (99.99 percent) of reliability will change from something HPC users only imagine to a must-have requirement.

About the Author

Dr. Paul Terry is the Chief Technology Officer and co-founder for OctigaBay Systems, an innovative new company in high performance computing. Dr. Terry oversees strategic direction for OctigaBay products and technologies and is the chief visionary behind a new HPC architecture, the Direct Connected Processor architecture. Dr. Terry holds a First Class BSc. Class 1 Honors degree in Physics with Electronics, and a PhD in Electronics and Electrical Engineering from the University of Liverpool in England. He earned his MBA from Cranfield University in England.

Sources

Dave Patterson and Aaron Brown, University of California at Berkeley in cooperation with Armando Fox, Stanford University, Recovery-Oriented Computing


Top of Page

  |  Table of Contents  |