HPCwire
 The global publication of record for High Performance Computing / April 2, 2004: Vol. 13, No. 13

Previous Article   |  Table of Contents  |  

Features:

SGI’S DAVE PARRY SPEAKS OUT ON PRODUCTION READY LINUX
by Mike Bernhardt

While many in the HPC industry have championed Linux for some time, the mainstream IT market has been slow to adopt Linux into business-critical environments. Even many within the HPC community perceive Linux as only suitable for lower-end departmental servers and small-node clusters.

Why do these differing views exist? What accounts for the impressive performance results and scaling achieved by some organizations that readily refer to their Linux installations as "production ready?"

HPCwire interviewed SGI’s Senior Vice President, Dave Parry, to get a better understanding of the adoption of Linux in the HPC community and his views on rolling Linux into "Prime Time."

HPCwire: Why is it that so many people still feel Linux is not ready for prime time, when SGI refers to Linux as being "production ready?"

Parry: It’s primarily an attitude thing—a natural backwards-looking bias. Certainly to date most of the Linux adoption in the enterprise - in anything but a pure research or university climate - has been in the low end of the desktop or initial beowulf cluster sort of usage – certainly not in hardcore production technical environments. The attitude that a lot of people seem to have when considering a production performance Linux solution like Altix is, "OK this is SGI making Linux systems that are something a little more than what everyone else is offering." We think of it differently. This isn’t an outgrowth of the existing, 32-bit low-end Linux installations that have been deployed so far. Rather, it is an attack directly from the side, straight into the production HPC market with a set of both hardware and software technologies that we’ve leveraged from our existing production ready environments – with the Origin system and IRIX operating system - and the Linux Community. It’s less about incrementally growing Linux into the production environment than bringing Linux capability into a production ready platform and using it in HPC environments.

There is still the perception that Linux can’t be scaled past 8 or16 processors because, frankly, Red Hat is the leader in Linux and they’re not scaling beyond that, nor are they currently interested in scalability past that. Their strategy is more around pervasiveness of Linux in the low-end rather than Linux for capability solutions. We have been doing our own standard build of Red Hat-compatible Linux going back to the original shipment of Altix 14 months ago, and now offer SUSE as well, and all through that time we have been building in the capabilities that we and others in the Open Source community have been putting together to make Linux work well and scale beyond 16 processors. For the vast majority of what people are buying, namely Red Hat Linux, they’re right, it doesn’t scale above 8 or 16p.

And it’s not just scalability of processors, it’s bandwidth, it’s memory sizes, it’s I/O performance as well, and a lot of the work we’ve done and contributed back into the Open Source base is built around improving the I/O performance. There’s a reason for this. In HPC environments, you don’t have to push very deep before talk of just the raw processor performance melts away and the I/O or memory capability becomes an issue. The other consideration is that just having the ability for an application to run across a bunch of processors is only part of the equation. You really need a system that’s usable in an HPC environment in a scalable way. You need run-time support to make use of the system. So the things we include in our ProPack, which adds enhanced features to the base Linux distribution, such as our MemSets and our high-performance math and MPI libraries, allow an application and/or a job mix to scale reasonably onto a large system.

HPCwire: What vertical market segments do you see as being the strongest in terms of Linux adoption?

Parry: We’ve seen the strongest adoption to date certainly in the sciences -- both in the national labs for doing basic research as well as in life sciences -- and we have a number of large customers using our Linux systems for earth science as well, such as climate modeling, ocean and weather prediction. Beyond that, manufacturing is next in line with the automotive industry moving quickly to Linux, especially in Europe and Asia. Some of our existing automotive customers are strongly embracing Linux.

And in the defense and government space, there’s a lot of interest in Linux. There are a lot of initiatives in the government to drive standards implementations, and Open Source plays into that. So there are some places in the government where there’s huge interest in buying and deploying Linux in their production environments. But there are others where they are still concerned about issues of security and ‘trustability’ of the code base, and those sorts of things—all, by the way, being addressed by the Community and key vendors like SGI and IBM.

HPCwire: What vertical markets are visibly slow to adopt Linux, and why?

Parry: Within the government there have been pockets of defense and intelligence that have been slow to adopt Linux because of the lack of multi- level security and security certifications. In areas like visual simulation, it’s been slow because of the requirement for real-time capability. And if you look at energy, ISV adoption for 64-bit Linux has also been slow. On the other hand, all the pre-stack migration and seismic processing stuff is almost across-the-board running on 32-bit Linux now. So there’s the question of Linux adoption in the energy industry for their throughput processing and then there’s the question of their willingness to move capability processing, and that ends up getting us all tied up in these questions of 32 vs. 64-bit ports as well.

HPCwire: There's no doubt that Linux has a significant world-wide developer infrastructure, but once applications are developed and/or ported, is Linux really stable enough to enable production-class productivity?

Parry: Absolutely. We now have on the order of 15,000 processors of our Linux systems installed, many of them in production-oriented environments. We believe that these customers are having very solid stability on their systems. The hardware story for us is a strong one -- our system reliability is every bit as good, and in fact a little better, with our Linux systems. They’re newer and have some new hardware capabilities in them. And on the software side, we’re seeing Linux being fairly robust in these operating environments. I would absolutely say it’s stable enough for the production HPC environments. Obviously, someone’s not going to do their 24x7 mission-critical fault tolerant application, but for production HPC environments, these systems are definitely production ready.

HPCwire: What about concerns that support problems often get passed off from the hardware vendor to a Linux distribution house such as Red Hat/SUSE, or to the systems integrator, creating a "hot potato" support environment that is unacceptable for production environments?

Parry: For the legacy low-end IA-32 market, that’s a very real problem. Particularly in HPC, with beowulf clusters, the style du jour is "roll your own" -- go buy your own hardware, buy the software and bundle it all together. And that’s a recipe for a lot of finger pointing whenever anything goes wrong. For the systems we ship there is no question about who is accountable for ensuring that the system starts working and stays working – and that’s SGI. If someone has a problem with one of our systems it doesn’t matter whether it’s hardware or software. They call us, and we will fix it. If it’s a system running our Advanced Linux environment (ALE), which is Red-Hat compatible Linux, then all levels of support run through SGI. If it’s a system running shrink-wrap SUSE 8 Service Pack 3, they still see a single point of contact and accountability – SGI - since we take advantage of the service agreement that we have with SUSE. This agreement provides for SGI to still be the first line and second line technical support; and for third line engineering back-up from SUSE to SGI for any issues that fall specifically to their distribution. For anyone to be successful with Linux in a production environment, they clearly must have an unambiguous story for how customer- support issues will be addressed - and it can’t be a story that involves the customer having to call two or three people.

HPCwire: Let's talk about something that is critical to the HPC community -- scalability. How far can Linux really scale at this point?

Parry: Scalability is a tough question because the word means different things to different people. We have seen right from the beginning that for batch- oriented, small job-count environments that are pure processing sorts of environments, things work great. And we’ve demonstrated some really great spec FP numbers, as well as some great performance numbers from our customers and partners, such as NASA Ames, on their applications.

But the other piece you have to think about is scalability of the I/O subsystem. To date, we have very strong scalability in our Linux systems. But the standard version has been more constrained. With the 2.6 Linux kernel coming along, there are fixes in the basic I/O infrastructure that will provide better I/O scalability there as well. And then lastly there’s work where you have a richer job mix, and scalability is really about scalability of the kernel as well as kernel facilities. And that’s an area where there’s been a lot of work put in by us and others on things to make the scheduler actually scalable -- to improve the memory management capability and to reduce the number of areas in the kernel and the operating system facilities that make use of large overlying locks. There’s a big kernel lock in the Linux kernel that’s called by a number of different facilities. And with shared effort across SGI and others in the community, a lot of that’s been reduced to improve the overall scalability of Linux. We have been shipping 64-processor Single System Image systems since we started shipping Altix back in January of 2003. We’ve had systems installed since last summer with a handful of customers running 128 processors in beta, and with a few select customers doing early access work with us at 256 processors and even 512 processors at NASA Ames. And in all cases we’re finding that these customers are achieving strong scalability on the systems. We announced results back in November at the Supercomputing show for the Streams benchmark with a terabyte per second of Triad bandwidth on a 512-processor SSI (single-kernel system image) system at NASA showing terrific scalability of system bandwidth. So we will be as of this quarter, well ahead of the industry by offering 256 processors in an Altix SSI, and we believe that we can deliver solid scalability on that system.

HPCwire: Can you tell us about a few SGI customers who are using Linux in production environments?

Parry: We have a number of customers in the automotive industry. With Mazda and Toyota in Japan, as well as Nissan; in Europe with Skoda Automotive and Daimler-Chrysler. In the life sciences area we now have a large number of Linux systems including three of the major commercial pharmaceutical players in addition to places like the US National Cancer Institute and Memorial Sloan Kettering. In defense/intelligence, we have multiple large installations for doing homeland security work.

HPCwire: Can you describe exactly what SGI delivers and supports in terms of Linux ?

Parry: Sure. SGI offers two versions of Linux on our platform. One is a strip-and-ship Red Hat-compatible version, which is called the Advanced Linux Environment, with ProPack. ALE refers to our build of the Linux kernel, where we take the same modules that are used for a given Red Hat version. As of today, we are compatible with Red Hat AS2.1 and we’re moving to compatibility with Red Hat AS3.0 in the May timeframe. So we take the same modules that Red Hat uses, take out the things that aren’t relevant for our platform, add in the things from the Open Source base that we and others have contributed that are necessary for having a strong, production ready 64-bit system and which Red Hat hasn’t yet put into their base distribution. And then we build that, and we form this advanced Linux kernel. And then on top of that we layer ProPack, which provides a bunch of advanced services targeted specifically at HPC environments. Things like our clustered file system; the data migration facility, which is our hierarchical storage management solution; our high- performance message passing tool kit, specialized math and scientific libraries; support for job management and management of memory and CPU placement. All of that bundles up to what is really our most HPC-oriented, performance optimized version of Linux.

For customers who have a keen need for the highest levels of certification (like Oracle’s Unbreakable Linux) with a standard distribution, we have a re- distribution agreement with SUSE in which you can buy a standard shrink-wrap copy of SUSE and it will boot and run on our system and support up to 64 processors in that system.

So, on the one hand we have a maximal performance system for pure HPC systems and on the other we have maximal certified compatibility with proximity applications to the HPC environment.

HPCwire: It seems that production shops would be thrilled with the "open" aspect of Linux and the fact that their future computing roadmaps do not have to be tied to the fate of a hardware vendor. What is SGI's perception of this and do you use this as a selling point when approaching a non-Linux shop?

Parry: Absolutely. The fact that Linux allows for a hardware-agnostic environment is a key selling point for us. With a number of HPC customers having already adopted Linux either in low-end clusters for throughput needs or on their desktops, it allows us to have a compelling story for the IT manager. We can deliver a solution with SGI all the way from our midrange departmental servers with our Altix 350 through our high-end Altix 3700 servers, thus providing a single environment and full binary compatibility across that environment. But that can also be integrated into environments that have lower end Linux-based clusters or desktops where users can run a single operating system environment across virtually their entire HPC enterprise.

HPCwire: What about Linux PC-based clusters -- is that an effective, low-cost solution for technical users?

Parry: There is a class of applications which are entirely throughput oriented or highly parallelizable and not especially demanding on the architectural and bandwidth capabilities of the systems. For such homogeneous throughput workflows, 32-bit Linux clusters can be a fine solution and I think we’ll continue to see them adopted, particularly now that Intel has announced 64-bit extensions to their Xeon architecture. And those fit very well as integrated Linux hybrid solutions with our scalable 64-bit solutions for more capability and heterogeneous throughput oriented workflows.

HPCwire: SGI seems to have figured out how to push the envelope when it comes to Linux performance. Where does that performance edge come from?

Parry: It comes from 20 years of experience doing high performance computing. It is embodied all the way up and down our hardware and software value chain. From the base NUMAflex architecture that we build our systems around, providing high bandwidth and low latency and great system scalability in memory, I/O and processing, to scores of software capabilities and facilities that we have ported and migrated from our proprietary IRIX environment to make them available in our Linux environment.

HPCwire: What do you see as the biggest technical barrier to implementing Linux in a production environment?

Parry: It’s application availability. It’s having all of the standard ISV applications and middleware as well as proximity applications ported, running and optimized. Plus, it’s having a bullet-proof development environment for roll-your-own code that customers may want to deploy. As we move from more research environments to more production environments, the need to be able to use these systems for more than one or two applications grows, and that puts pressure on the need for a broader set of ISV capture. That’s something both we and Intel have had a large focus on, everything from co-marketing programs with ISVs, to loaner and demo systems to assisting with application porting, to application porting centers that Intel has around the world. So in SGI today, we have over 140 64-bit Linux applications that are ported and running – specifically HPC applications on Altix - of which over half have been specifically optimized to take advantage of the Altix architecture. And Intel, SUSE, and Red Hat have literally thousands of applications that they have worked on with ISVs across HPC and the enterprise for Itanium and Linux.

Another barrier, for customers who view migration to Linux as a bottom-up approach, is that the Linux applications they run today are running on low-end Linux clusters. So adopting Linux in their production environments means moving up the food chain to make their applications 64-bit friendly and make them fully capable of taking advantage of an Itanium-based system.

The other strong technical barrier is that proprietary RISC vendors have done, and continue to do, whatever they can to drive customer lock-in on their platforms. There is a need for porting of applications and capabilities from proprietary UNIX platforms and for most vendors they’re going to do whatever they can to make that more difficult.

HPCwire: After many years of growing the MIPS/IRIX technology, what drove the decision for SGI to move to a Linux-based system?

Parry: It was a combination of factors. One was an acknowledgment on the microprocessor side that Intel is investing billions of dollars in microprocessor development and focusing with the Itanium line on developing very high performance microprocessors. That’s an investment level SGI couldn’t possibly hope to match, so we saw a lot of benefit in taking our world class system architecture and coupling it with the world-class microprocessor architecture in Itanium.

On the question of the move from IRIX to Linux, it’s all about applications. And it’s becoming increasingly difficult to maintain proprietary ports of applications. ISVs want to move to a model of supporting a single source base, and Linux has a great wealth of possibilities for driving application capture.

The other factor for us was a practical matter: we seek to provide solutions for technical and scientific users and in providing an operating system there’s a pile of things you have to do for basic compatibility with things like Java and networking stacks. Developing absolutely everything for both the kernel and the operating environment, as one has to do with a proprietary UNIX strategy, requires an enormous amount of effort. By adopting a Linux strategy, we’re able to rely on the effort of the Open Source community for producing all the baseline capability that needs to be there for enterprise implementations of Linux. This approach lets us focus our efforts where we can really add value, which is in optimization and capabilities specifically targeted at the HPC market, and at technical and scientific users.

HPCwire: What channel partners do you work with to add value to your Linux - based systems?

Parry: We’re working with a number of different kinds of channel partners in adding value to our Linux systems. We continue to work with virtually all of the major federal systems integrators for systems integration capabilities into the defense and homeland security space. We also are working with many of the leading ISVs and middleware vendors to provide full capability on our systems from a software perspective, for example companies like Platform Computing and Altair for scheduling. On the hardware side, its companies like Voltaire, who we recently announced a partnership with to provide Infiniband on our Altix platform. And there are many VARs who are thrilled to have a standards-based midrange solution (the Altix 350) to sell into markets that could only be served until now by proprietary solutions like the IBM p655 or Sun V-series. It’s obviously a long list… but we are always looking for new partners!

HPCwire: You recently announced Altix 350, which takes aim at UNIX SMP midrange technical servers from HP, IBM and Sun. Do you foresee a Linux server making inroads into that midrange market over the next year? Next five years?

Parry: We absolutely expect to see Linux make inroads into the departmental and technical database portions of the HPC market, which previously has been largely the sole purview of proprietary UNIX-based platforms. The Altix 350 is a solution that serves that market very well. It has great performance, can be offered at more aggressive price points than proprietary UNIX solutions, and will play well into the HPC IT manager’s desire to move to Open Source Linux solutions for their users.

HPCwire: What about graphics and visualization - don't UNIX systems deliver a major advantage in that area?

Parry: Today, that is the case, and I would argue SGI is at the front of the pack in delivering high levels of value-add for graphics and visualization with our Onyx products. On the other hand, there is a huge industry around graphics at the lower end for PC gaming and in commodity graphics clusters for low-end visual simulation applications. I expect we will see Linux begin to pervade the graphics and visualization space as well. SGI has started offering a developer kit bundled to key ISVs in the graphics and visualization space of what will be our upcoming migration of our graphics capability from the Onyx product line to make all those same capabilities available over time on an Itanium/Linux-based product.

HPCwire: So Dave, can you tell us what's next for Linux and what's next for SGI?

Parry: For Linux, what’s next is bringing a lot more core capability to really all of the Linux offerings. With the move to a 2.6 kernel base coming up later this year, many features and new capabilities that have been there as patches on the 2.4 base - and in some cases new features for things like preemptive job scheduling - are going to enable base Linux distributions to do a lot more. This will enable companies like SGI to do more for our specific markets for bringing things like real-time capability and security capability into Linux.

For SGI, what’s next is more pervasive penetration of our Linux solutions across all of our customers and markets. We are going to continue to push the environment of scalability and performance at the high end with the Altix 3700 platform as well as driving down into departmental midrange solutions with the Altix 350 product. We’ll be pushing across a broader swath of the HPC marketplace.


Top of Page

Previous Article   |  Table of Contents  |