
Features:
A SCIENCE OF BIG NUMBERS
by J. William Bell NCSA Access Magazine Managing Editor
Natalia Maltsev calls bioinformatics a "science of big numbers." Instead of
focusing on a single cell, protein, or organism in the lab or using computer
simulation, bioinformatics looks at numerous organisms computationally. It is
the search for similarities and differences in hundreds of thousands of genome
sequences, protein structures, and other features of biological systems and
particles. When properly understood, these variations can show researchers a
given piece of the system's function.
Labs around the globe constantly churn out new sequences, running the gamut
from the most simple virus to hugely complex organisms like ourselves.
According to GOLD, an online guide to published genomic data, some 140
species' genomes have been completed and nearly 600 other organisms are
currently being sequenced. In August 2002, GenBank, which is one of the main
sequence databases, contained maps for some 22 billion nucleotide bases, the
individual building blocks that make up a gene sequence.
These data are a boon for bioinformatics experts like Maltsev; they're the big
numbers that make a science of big numbers possible. But working with them
requires tedious search sessions and cumbersome analysis procedures.
A new "analysis pipeline" that relies on grid computing promises to automate
the genome-analysis process and make it much easier. The Genome Analysis and
Database Update system, or GADU, is being developed by Argonne National
Laboratory's computational biology group. The team includes Maltsev, Dinanath
Sulakhe, and Alex Rodriguez, a PhD student in the University of Illinois at
Chicago's bioengineering department, and is part of the Alliance's data quest
expedition. The data quest expedition builds tools for data-intensive
applications, like those in bioinformatics, that run on Alliance and TeraGrid
resources. Further help, support, and guidance come from throughout the
Alliance by way of the scientific workspaces for the future expedition and the
scientific portals expedition.
"The amount of data is increasing exponentially," says Maltsev. "It dictates a
need to really be able to scale up the analysis capabilities…The Grid and
distributed computing provide an ideal match for the type of problems that
bioinformatics is facing."
After a series of test runs on the Alliance's Condor system at the University
of Wisconsin and the Chiba City cluster at Argonne, GADU was fully put through
its paces in April 2003. The application analyzed 59 microbial genomes in
about a day. This process required that more than 10,000 jobs be submitted and
represented a five-fold improvement in turnaround time, according to Maltsev.
The runs were completed on the Department of Energy's Science Grid using NCSA
network bandwidth and storage space. Another run in June further solidified
the system's value. The team compared 1.8 million protein sequences to one
another using 200 hundred processors on a cluster at Argonne. What would have
taken about seven years to complete on a single desktop system was finished in
about three days.
"This is a great success for the field of bioinformatics--one of the first
examples of the discipline taking full advantage of a Grid-based system," says
Dan Reed, director of NCSA and the Alliance. "One of the things we have
learned over the life of the Alliance is that we're at our best when we form
multidisciplinary teams and give those teams a clear mission as they focus on
the deployment of technology. So this is also a great example of what the
Alliance expeditions teams can do and how those teams can make contributions
to a project from end to end."
Tedious and difficult becomes automatic and reliable
Before analysis can start, GADU has to cull the sequence data from specialized
databases such as GenBank. Via a Web-based interface, GADU users select the
databases that they are interested in, the bioinformatics tools they want to
employ, and the frequency with which they want their data analyzed. On that
schedule and with that list of goals, GADU compares the content of the
selected databases to the data that are already stored on the user's local
system or a designated public server. Any new content is downloaded, old
genome records are updated, and files for any new genomes are added to the
user's library. Alternatively, the user may choose to be notified by email of
new content and manually select data to be taken from the databases.
Using Argonne's Globus toolkit, the sequence data are formatted so they can be
easily passed to Grid-based machines for analysis. Annotation data--notes on
what other scientists have learned about the sequence--are parsed and stored
separately.
"What was tedious and difficult becomes automatic and reliable," says Maltsev.
Once the new sequence data are found and taken from the public databases, GADU
submits them to the Grid's computational resources for analysis. GADU
currently supports three of the most common bioinformatics tools that reside
on machines around the world: Blast, Blocks, and Pfam. These tools, which are
variations on a theme, do the painstaking work of assessing the sequences and
finding variations. Biological sequences for which the functions are unknown
are compared to those with known functions using computationally intensive
algorithms. The analysis is automatically checked at various points in the
process. If any of these checks fails, the analysis of that segment is aborted
and restarted.
"We're talking about a humongous number of comparisons," says Maltsev. "An
average bacteria genome has 4,000 genes that encode proteins." Compounding
that, "six or seven genomes are acquired every month, and the pace is ever
increasing."
Storage represents GADU's third function. The system places the annotations
that accompany the public data in permanent storage, holds the files that are
to be analyzed, and formats the results so they can be searched easily. An
archive is also kept as GADU's acquisition module collects updated versions of
the same gene sequence over time. Finally, the system needs temporary storage
where intermediate versions of the data are kept during analysis. This space
was provided by NCSA during the Argonne team's April run. Chimera, a data-
management tool developed as part of NSF's Grid Physics Network (GriPhyN),
controls the flow of these data and properly labels and catalogs them for
future search.
A change in the sociology of science
A prototype of a GADU-based genome analysis server, including a public portal
that will provide access to genome data that have already been analyzed, will
be released by early fall. But the tool's implications are already fully
realized in the minds of its creators. Most obviously, it makes the systems
that bioinformaticists need for their massive calculations easy to use.
The system's public portal, which is being developed by the Alliance's
scientific portals expedition and scientific workplaces for the future
expedition, expands GADU's benefits.
Bioinformatics has historically been something of a cottage industry. Research
groups build their own in-house computers to crunch their data. This situation
leaves those at smaller universities and groups with less funding to struggle.
With GADU's portal, however, "anyone can use precomputed results [that are in
GADU's database] or they can use the secure facilities of the public server to
process their data. It will eliminate huge amounts of redundant work that
people are doing," according to Maltsev. "I personally believe [this sort of
thinking] will completely change the sociology of science."
"It will provide access for a lot of scientists who couldn't even dream about
this to the capabilities of large computations. For the sciences that rely on
the computation of huge amounts of data or complicated simulations or huge
models, it will provide the framework. People will be free to quit thinking
about the framework and implement the scientific part."
Ian Foster--associate director of Argonne's mathematics and computer science
division, co-leader of the data quest expedition, and a member of the
Alliance's Executive Committee--concurs, pointing out that a framework like
GADU has the ability to impact other disciplines. In a recent issue of
BioInform, a bioinformatics newsletter published by GenomeWeb, he said,
"Everyone thinks they're special [when they're moving their applications to
grid-based systems]…In some sense they're not because the basic technology
requirements are the same."
Funding information This research is supported by the National Science
Foundation, the National Institutes of Health, the National Computational
Science Alliance, and the University of Chicago.
Access Online URL http://access.ncsa.uiuc.edu/CoverStories/GADU/
Team members Susan Coghlan Terry Disz Natalia Maltsev Zach Miller Michael
Milligan Nika Nefedova Alex Rodriguez Dinanath Sulakhe Jens Voeckler Gregor
von Laszewski Von Welch Mike Wilde
|