
Features - Enterprise Data Insights:
THINK YOUR DATA INFRASTRUCTURE IS GETTING OUT OF HAND?
by Todd Spangler
Think your data infrastructure is getting out of hand? Try telling that to
Michelle Butler, who oversees a storage infrastructure that grows one to two
Tbytes per day.
Butler is technical program manager of the Storage Enabling Technologies
Group at the National Center for Supercomputing Applications (NCSA) at the
University of Illinois in Champaign-Urbana, Ill. NCSA, which is probably best
known as the place where the graphical Web browser was invented, provides
high-performance computing systems for a wide variety of science and
engineering programs, everything from earthquake modeling to biochemical
research.
NCSA's storage group, which comprises seven full-time staffers including
Butler, is in charge of managing a 500-Tbyte (and growing) mass storage
system; backup and recovery of that data; all of the SAN production and
research; and research into parallel and clustered file systems. NCSA's mass
storage system has 2,000 and 3,000 active users at any given time.
Yet even with such huge storage capacity requirements, NCSA had resisted
implementing a SAN until March 2002. Butler says that until recently she felt
SAN technology wasn't reliable enough. "SAN just wasn't as safe as it needed
it to be for our production environment," she says. "We really needed
high-availability switches to make sure they didn't go down."
The final decision to go to a SAN architecture was sparked by the fact that
the NCSA's Windows NT, Unix, and mass storage groups were each getting ready
to purchase vast amounts of new disk storage at the same time. Click! The
light bulb flicked on. "With the SAN, we wanted to bring a large amount of
disk in here so that multiple systems could access it," Butler says. After
some lab testing, Butler felt assured that Fibre Channel infrastructure was
reliable enough to run NCSA's storage on.
In the first phase of the SAN rollout, NCSA deployed 60 Tbytes of DataDirect
Networks Inc storage connected to eight 16-port Brocade Communications
Systems Inc SilkWorm 3800 switches and one
64-port SilkWorm 12000. The SAN connects more than 200 host servers via QLogic
Corp host bus adapters.
Butler says NCSA selected Brocade because the company was willing to engage
NCSA as a development partner rather than as an ordinary customer. "We are
trying to break the 12000 so they can build a better switch," Butler says.
Plus, she says, she got "really great pricing." NCSA
also wanted to use Brocade's Fabric Access API to pull data into its
proprietary management system to monitor the health of each switch down the
port level.
NCSA's storage group tested the 12000's processor-failover capability by
throwing enough corrupt data. Butler says that feature worked as advertised.
However, the group did encounter an issue in trying to upgrade the 12000's
firmware to fix a date-related bug in the switch. Since the switch doesn't
support nondisruptive code activation, it must be taken offline while the
update occurs. "I would say it's a drawback that you have to bring the whole
switch down to do a code load," she says. "Even though it's just for 10
minutes, it brings the whole center down." Brocade has promised to deliver hot
code-load activation for the 12000 early next year.
But that shortcoming hasn't stopped NCSA from buying three more 12000s (and a
fourth on the way), which it will use in server clusters for TeraGrid, a large
computing network sponsored by the National Science Foundation (NSF) that will
be distributed among five research facilities -- NCSA, Argonne National
Laboratory, the California Institute of Technology, the Pittsburgh
Supercomputing Center, and the San Diego Supercomputer Center.
In January, NCSA will receive the first machines that will be part of
TeraGrid: 256 IBM Corp Linux servers running
Itanium, Intel Corp's 64-bit microprocessor.
That will be followed by 700 servers in June. NCSA's TeraGrid cluster will
include 230 Tbytes of spinning disk initially, running in IBM FastT 700
arrays. Butler says it may possibly add another 200 Tbytes later in the
year.
However, Butler says, the TeraGrid SAN is not yet ready for prime time,
primarily because of the immaturity of Linux. "In a Linux environment, it's
hard to build a bulletproof SAN," she says. "Right now the Linux OS can't
failover to an alternate path to their system disk. I don't have support from
the file system, so if the Linux systems go down they're dead in the
water."
Older Unix operating systems, such as those from Sun Microsystems Inc,
Hewlett-Packard Co, already include multipath I/O. Butler says Red Hat Inc and
other vendors are building enterprise features into
Linux.
"Linux is new," Butler says. "This is part of its evolution, and we're pushing
the technology as fast as it can go."
|