
Features:
SGI’S DAVE PARRY SPEAKS OUT ON PRODUCTION READY LINUX
by Mike Bernhardt
While many in the HPC industry have championed Linux for some time, the
mainstream IT market has been slow to adopt Linux into business-critical
environments. Even many within the HPC community perceive Linux as only
suitable for lower-end departmental servers and small-node clusters.
Why do these differing views exist? What accounts for the impressive
performance results and scaling achieved by some organizations that readily
refer to their Linux installations as "production ready?"
HPCwire interviewed SGI’s Senior Vice President, Dave Parry, to get a better
understanding of the adoption of Linux in the HPC community and his views on
rolling Linux into "Prime Time."
HPCwire: Why is it that so many people still feel Linux is not ready for
prime time, when SGI refers to Linux as being "production ready?"
Parry: It’s primarily an attitude thing—a natural backwards-looking bias.
Certainly to date most of the Linux adoption in the enterprise - in anything
but a pure research or university climate - has been in the low end of the
desktop or initial beowulf cluster sort of usage – certainly not in hardcore
production technical environments. The attitude that a lot of people seem to
have when considering a production performance Linux solution like Altix is,
"OK this is SGI making Linux systems that are something a little more than
what everyone else is offering." We think of it differently. This isn’t an
outgrowth of the existing, 32-bit low-end Linux installations that have been
deployed so far. Rather, it is an attack directly from the side, straight
into the production HPC market with a set of both hardware and software
technologies that we’ve leveraged from our existing production ready
environments – with the Origin system and IRIX operating system - and the
Linux Community. It’s less about incrementally growing Linux into the
production environment than bringing Linux capability into a production ready
platform and using it in HPC environments.
There is still the perception that Linux can’t be scaled past 8 or16
processors because, frankly, Red Hat is the leader in Linux and they’re not
scaling beyond that, nor are they currently interested in scalability past
that. Their strategy is more around pervasiveness of Linux in the low-end
rather than Linux for capability solutions. We have been doing our own
standard build of Red Hat-compatible Linux going back to the original shipment
of Altix 14 months ago, and now offer SUSE as well, and all through that time
we have been building in the capabilities that we and others in the Open
Source community have been putting together to make Linux work well and scale
beyond 16 processors. For the vast majority of what people are buying, namely
Red Hat Linux, they’re right, it doesn’t scale above 8 or 16p.
And it’s not just scalability of processors, it’s bandwidth, it’s memory
sizes, it’s I/O performance as well, and a lot of the work we’ve done and
contributed back into the Open Source base is built around improving the I/O
performance. There’s a reason for this. In HPC environments, you don’t have
to push very deep before talk of just the raw processor performance melts away
and the I/O or memory capability becomes an issue. The other consideration is
that just having the ability for an application to run across a bunch of
processors is only part of the equation. You really need a system that’s
usable in an HPC environment in a scalable way. You need run-time support to
make use of the system. So the things we include in our ProPack, which adds
enhanced features to the base Linux distribution, such as our MemSets and our
high-performance math and MPI libraries, allow an application and/or a job mix
to scale reasonably onto a large system.
HPCwire: What vertical market segments do you see as being the strongest in
terms of Linux adoption?
Parry: We’ve seen the strongest adoption to date certainly in the sciences --
both in the national labs for doing basic research as well as in life sciences
-- and we have a number of large customers using our Linux systems for earth
science as well, such as climate modeling, ocean and weather prediction.
Beyond that, manufacturing is next in line with the automotive industry moving
quickly to Linux, especially in Europe and Asia. Some of our existing
automotive customers are strongly embracing Linux.
And in the defense and government space, there’s a lot of interest in Linux.
There are a lot of initiatives in the government to drive standards
implementations, and Open Source plays into that. So there are some places in
the government where there’s huge interest in buying and deploying Linux in
their production environments. But there are others where they are still
concerned about issues of security and ‘trustability’ of the code base, and
those sorts of things—all, by the way, being addressed by the Community and
key vendors like SGI and IBM.
HPCwire: What vertical markets are visibly slow to adopt Linux, and why?
Parry: Within the government there have been pockets of defense and
intelligence that have been slow to adopt Linux because of the lack of multi-
level security and security certifications. In areas like visual simulation,
it’s been slow because of the requirement for real-time capability. And if
you look at energy, ISV adoption for 64-bit Linux has also been slow. On the
other hand, all the pre-stack migration and seismic processing stuff is almost
across-the-board running on 32-bit Linux now. So there’s the question of
Linux adoption in the energy industry for their throughput processing and then
there’s the question of their willingness to move capability processing, and
that ends up getting us all tied up in these questions of 32 vs. 64-bit ports
as well.
HPCwire: There's no doubt that Linux has a significant world-wide developer
infrastructure, but once applications are developed and/or ported, is Linux
really stable enough to enable production-class productivity?
Parry: Absolutely. We now have on the order of 15,000 processors of our
Linux systems installed, many of them in production-oriented environments. We
believe that these customers are having very solid stability on their systems.
The hardware story for us is a strong one -- our system reliability is every
bit as good, and in fact a little better, with our Linux systems. They’re
newer and have some new hardware capabilities in them. And on the software
side, we’re seeing Linux being fairly robust in these operating environments.
I would absolutely say it’s stable enough for the production HPC environments.
Obviously, someone’s not going to do their 24x7 mission-critical fault
tolerant application, but for production HPC environments, these systems are
definitely production ready.
HPCwire: What about concerns that support problems often get passed off from
the hardware vendor to a Linux distribution house such as Red Hat/SUSE, or to
the systems integrator, creating a "hot potato" support environment that is
unacceptable for production environments?
Parry: For the legacy low-end IA-32 market, that’s a very real problem.
Particularly in HPC, with beowulf clusters, the style du jour is "roll your
own" -- go buy your own hardware, buy the software and bundle it all together.
And that’s a recipe for a lot of finger pointing whenever anything goes wrong.
For the systems we ship there is no question about who is accountable for
ensuring that the system starts working and stays working – and that’s SGI.
If someone has a problem with one of our systems it doesn’t matter whether
it’s hardware or software. They call us, and we will fix it. If it’s a
system running our Advanced Linux environment (ALE), which is Red-Hat
compatible Linux, then all levels of support run through SGI. If it’s a
system running shrink-wrap SUSE 8 Service Pack 3, they still see a single
point of contact and accountability – SGI - since we take advantage of the
service agreement that we have with SUSE. This agreement provides for SGI to
still be the first line and second line technical support; and for third line
engineering back-up from SUSE to SGI for any issues that fall specifically to
their distribution. For anyone to be successful with Linux in a production
environment, they clearly must have an unambiguous story for how customer-
support issues will be addressed - and it can’t be a story that involves the
customer having to call two or three people.
HPCwire: Let's talk about something that is critical to the HPC community --
scalability. How far can Linux really scale at this point?
Parry: Scalability is a tough question because the word means different things
to different people. We have seen right from the beginning that for batch-
oriented, small job-count environments that are pure processing sorts of
environments, things work great. And we’ve demonstrated some really great
spec FP numbers, as well as some great performance numbers from our customers
and partners, such as NASA Ames, on their applications.
But the other piece you have to think about is scalability of the I/O
subsystem. To date, we have very strong scalability in our Linux systems.
But the standard version has been more constrained. With the 2.6 Linux kernel
coming along, there are fixes in the basic I/O infrastructure that will
provide better I/O scalability there as well. And then lastly there’s work
where you have a richer job mix, and scalability is really about scalability
of the kernel as well as kernel facilities. And that’s an area where there’s
been a lot of work put in by us and others on things to make the scheduler
actually scalable -- to improve the memory management capability and to reduce
the number of areas in the kernel and the operating system facilities that
make use of large overlying locks. There’s a big kernel lock in the Linux
kernel that’s called by a number of different facilities. And with shared
effort across SGI and others in the community, a lot of that’s been reduced to
improve the overall scalability of Linux. We have been shipping 64-processor
Single System Image systems since we started shipping Altix back in January of
2003. We’ve had systems installed since last summer with a handful of
customers running 128 processors in beta, and with a few select customers
doing early access work with us at 256 processors and even 512 processors at
NASA Ames. And in all cases we’re finding that these customers are achieving
strong scalability on the systems. We announced results back in November at
the Supercomputing show for the Streams benchmark with a terabyte per second
of Triad bandwidth on a 512-processor SSI (single-kernel system image) system
at NASA showing terrific scalability of system bandwidth. So we will be as of
this quarter, well ahead of the industry by offering 256 processors in an
Altix SSI, and we believe that we can deliver solid scalability on that
system.
HPCwire: Can you tell us about a few SGI customers who are using Linux in
production environments?
Parry: We have a number of customers in the automotive industry. With Mazda
and Toyota in Japan, as well as Nissan; in Europe with Skoda Automotive and
Daimler-Chrysler. In the life sciences area we now have a large number of
Linux systems including three of the major commercial pharmaceutical players
in addition to places like the US National Cancer Institute and Memorial Sloan
Kettering. In defense/intelligence, we have multiple large installations for
doing homeland security work.
HPCwire: Can you describe exactly what SGI delivers and supports in terms of
Linux ?
Parry: Sure. SGI offers two versions of Linux on our platform. One is a
strip-and-ship Red Hat-compatible version, which is called the Advanced Linux
Environment, with ProPack. ALE refers to our build of the Linux kernel, where
we take the same modules that are used for a given Red Hat version. As of
today, we are compatible with Red Hat AS2.1 and we’re moving to compatibility
with Red Hat AS3.0 in the May timeframe. So we take the same modules that Red
Hat uses, take out the things that aren’t relevant for our platform, add in
the things from the Open Source base that we and others have contributed that
are necessary for having a strong, production ready 64-bit system and which
Red Hat hasn’t yet put into their base distribution. And then we build that,
and we form this advanced Linux kernel. And then on top of that we layer
ProPack, which provides a bunch of advanced services targeted specifically at
HPC environments. Things like our clustered file system; the data migration
facility, which is our hierarchical storage management solution; our high-
performance message passing tool kit, specialized math and scientific
libraries; support for job management and management of memory and CPU
placement. All of that bundles up to what is really our most HPC-oriented,
performance optimized version of Linux.
For customers who have a keen need for the highest levels of certification
(like Oracle’s Unbreakable Linux) with a standard distribution, we have a re-
distribution agreement with SUSE in which you can buy a standard shrink-wrap
copy of SUSE and it will boot and run on our system and support up to 64
processors in that system.
So, on the one hand we have a maximal performance system for pure HPC systems
and on the other we have maximal certified compatibility with proximity
applications to the HPC environment.
HPCwire: It seems that production shops would be thrilled with the "open"
aspect of Linux and the fact that their future computing roadmaps do not have
to be tied to the fate of a hardware vendor. What is SGI's perception of this
and do you use this as a selling point when approaching a non-Linux shop?
Parry: Absolutely. The fact that Linux allows for a hardware-agnostic
environment is a key selling point for us. With a number of HPC customers
having already adopted Linux either in low-end clusters for throughput needs
or on their desktops, it allows us to have a compelling story for the IT
manager. We can deliver a solution with SGI all the way from our midrange
departmental servers with our Altix 350 through our high-end Altix 3700
servers, thus providing a single environment and full binary compatibility
across that environment. But that can also be integrated into environments
that have lower end Linux-based clusters or desktops where users can run a
single operating system environment across virtually their entire HPC
enterprise.
HPCwire: What about Linux PC-based clusters -- is that an effective, low-cost
solution for technical users?
Parry: There is a class of applications which are entirely throughput
oriented or highly parallelizable and not especially demanding on the
architectural and bandwidth capabilities of the systems. For such homogeneous
throughput workflows, 32-bit Linux clusters can be a fine solution and I think
we’ll continue to see them adopted, particularly now that Intel has announced
64-bit extensions to their Xeon architecture. And those fit very well as
integrated Linux hybrid solutions with our scalable 64-bit solutions for more
capability and heterogeneous throughput oriented workflows.
HPCwire: SGI seems to have figured out how to push the envelope when it comes
to Linux performance. Where does that performance edge come from?
Parry: It comes from 20 years of experience doing high performance computing.
It is embodied all the way up and down our hardware and software value chain.
From the base NUMAflex architecture that we build our systems around,
providing high bandwidth and low latency and great system scalability in
memory, I/O and processing, to scores of software capabilities and facilities
that we have ported and migrated from our proprietary IRIX environment to make
them available in our Linux environment.
HPCwire: What do you see as the biggest technical barrier to implementing
Linux in a production environment?
Parry: It’s application availability. It’s having all of the standard ISV
applications and middleware as well as proximity applications ported, running
and optimized. Plus, it’s having a bullet-proof development environment for
roll-your-own code that customers may want to deploy. As we move from more
research environments to more production environments, the need to be able to
use these systems for more than one or two applications grows, and that puts
pressure on the need for a broader set of ISV capture. That’s something both
we and Intel have had a large focus on, everything from co-marketing programs
with ISVs, to loaner and demo systems to assisting with application porting,
to application porting centers that Intel has around the world. So in SGI
today, we have over 140 64-bit Linux applications that are ported and running
– specifically HPC applications on Altix - of which over half have been
specifically optimized to take advantage of the Altix architecture. And
Intel, SUSE, and Red Hat have literally thousands of applications that they
have worked on with ISVs across HPC and the enterprise for Itanium and Linux.
Another barrier, for customers who view migration to Linux as a bottom-up
approach, is that the Linux applications they run today are running on low-end
Linux clusters. So adopting Linux in their production environments means
moving up the food chain to make their applications 64-bit friendly and make
them fully capable of taking advantage of an Itanium-based system.
The other strong technical barrier is that proprietary RISC vendors have done,
and continue to do, whatever they can to drive customer lock-in on their
platforms. There is a need for porting of applications and capabilities from
proprietary UNIX platforms and for most vendors they’re going to do whatever
they can to make that more difficult.
HPCwire: After many years of growing the MIPS/IRIX technology, what drove the
decision for SGI to move to a Linux-based system?
Parry: It was a combination of factors. One was an acknowledgment on the
microprocessor side that Intel is investing billions of dollars in
microprocessor development and focusing with the Itanium line on developing
very high performance microprocessors. That’s an investment level SGI
couldn’t possibly hope to match, so we saw a lot of benefit in taking our
world class system architecture and coupling it with the world-class
microprocessor architecture in Itanium.
On the question of the move from IRIX to Linux, it’s all about applications.
And it’s becoming increasingly difficult to maintain proprietary ports of
applications. ISVs want to move to a model of supporting a single source
base, and Linux has a great wealth of possibilities for driving application
capture.
The other factor for us was a practical matter: we seek to provide solutions
for technical and scientific users and in providing an operating system
there’s a pile of things you have to do for basic compatibility with things
like Java and networking stacks. Developing absolutely everything for both the
kernel and the operating environment, as one has to do with a proprietary UNIX
strategy, requires an enormous amount of effort. By adopting a Linux
strategy, we’re able to rely on the effort of the Open Source community for
producing all the baseline capability that needs to be there for enterprise
implementations of Linux. This approach lets us focus our efforts where we
can really add value, which is in optimization and capabilities specifically
targeted at the HPC market, and at technical and scientific users.
HPCwire: What channel partners do you work with to add value to your Linux -
based systems?
Parry: We’re working with a number of different kinds of channel partners in
adding value to our Linux systems. We continue to work with virtually all of
the major federal systems integrators for systems integration capabilities
into the defense and homeland security space. We also are working with many
of the leading ISVs and middleware vendors to provide full capability on our
systems from a software perspective, for example companies like Platform
Computing and Altair for scheduling. On the hardware side, its companies like
Voltaire, who we recently announced a partnership with to provide Infiniband
on our Altix platform. And there are many VARs who are thrilled to have a
standards-based midrange solution (the Altix 350) to sell into markets that
could only be served until now by proprietary solutions like the IBM p655 or
Sun V-series. It’s obviously a long list… but we are always looking for new
partners!
HPCwire: You recently announced Altix 350, which takes aim at UNIX SMP
midrange technical servers from HP, IBM and Sun. Do you foresee a Linux
server making inroads into that midrange market over the next year? Next five
years?
Parry: We absolutely expect to see Linux make inroads into the departmental
and technical database portions of the HPC market, which previously has been
largely the sole purview of proprietary UNIX-based platforms. The Altix 350
is a solution that serves that market very well. It has great performance,
can be offered at more aggressive price points than proprietary UNIX
solutions, and will play well into the HPC IT manager’s desire to move to Open
Source Linux solutions for their users.
HPCwire: What about graphics and visualization - don't UNIX systems deliver
a major advantage in that area?
Parry: Today, that is the case, and I would argue SGI is at the front of the
pack in delivering high levels of value-add for graphics and visualization
with our Onyx products. On the other hand, there is a huge industry around
graphics at the lower end for PC gaming and in commodity graphics clusters for
low-end visual simulation applications. I expect we will see Linux begin to
pervade the graphics and visualization space as well. SGI has started
offering a developer kit bundled to key ISVs in the graphics and visualization
space of what will be our upcoming migration of our graphics capability from
the Onyx product line to make all those same capabilities available over time
on an Itanium/Linux-based product.
HPCwire: So Dave, can you tell us what's next for Linux and what's next for
SGI?
Parry: For Linux, what’s next is bringing a lot more core capability to
really all of the Linux offerings. With the move to a 2.6 kernel base coming
up later this year, many features and new capabilities that have been there as
patches on the 2.4 base - and in some cases new features for things like
preemptive job scheduling - are going to enable base Linux distributions to do
a lot more. This will enable companies like SGI to do more for our specific
markets for bringing things like real-time capability and security capability
into Linux.
For SGI, what’s next is more pervasive penetration of our Linux solutions
across all of our customers and markets. We are going to continue to push the
environment of scalability and performance at the high end with the Altix 3700
platform as well as driving down into departmental midrange solutions with the
Altix 350 product. We’ll be pushing across a broader swath of the HPC
marketplace.
|