HPCwire
 The global publication of record for High Performance Computing / November 14, 2003: Vol. 12, No. 45

  |  Table of Contents  |  

Features:

A SCIENCE OF BIG NUMBERS
by J. William Bell NCSA Access Magazine Managing Editor

Natalia Maltsev calls bioinformatics a "science of big numbers." Instead of focusing on a single cell, protein, or organism in the lab or using computer simulation, bioinformatics looks at numerous organisms computationally. It is the search for similarities and differences in hundreds of thousands of genome sequences, protein structures, and other features of biological systems and particles. When properly understood, these variations can show researchers a given piece of the system's function.

Labs around the globe constantly churn out new sequences, running the gamut from the most simple virus to hugely complex organisms like ourselves. According to GOLD, an online guide to published genomic data, some 140 species' genomes have been completed and nearly 600 other organisms are currently being sequenced. In August 2002, GenBank, which is one of the main sequence databases, contained maps for some 22 billion nucleotide bases, the individual building blocks that make up a gene sequence.

These data are a boon for bioinformatics experts like Maltsev; they're the big numbers that make a science of big numbers possible. But working with them requires tedious search sessions and cumbersome analysis procedures.

A new "analysis pipeline" that relies on grid computing promises to automate the genome-analysis process and make it much easier. The Genome Analysis and Database Update system, or GADU, is being developed by Argonne National Laboratory's computational biology group. The team includes Maltsev, Dinanath Sulakhe, and Alex Rodriguez, a PhD student in the University of Illinois at Chicago's bioengineering department, and is part of the Alliance's data quest expedition. The data quest expedition builds tools for data-intensive applications, like those in bioinformatics, that run on Alliance and TeraGrid resources. Further help, support, and guidance come from throughout the Alliance by way of the scientific workspaces for the future expedition and the scientific portals expedition.

"The amount of data is increasing exponentially," says Maltsev. "It dictates a need to really be able to scale up the analysis capabilities…The Grid and distributed computing provide an ideal match for the type of problems that bioinformatics is facing."

After a series of test runs on the Alliance's Condor system at the University of Wisconsin and the Chiba City cluster at Argonne, GADU was fully put through its paces in April 2003. The application analyzed 59 microbial genomes in about a day. This process required that more than 10,000 jobs be submitted and represented a five-fold improvement in turnaround time, according to Maltsev. The runs were completed on the Department of Energy's Science Grid using NCSA network bandwidth and storage space. Another run in June further solidified the system's value. The team compared 1.8 million protein sequences to one another using 200 hundred processors on a cluster at Argonne. What would have taken about seven years to complete on a single desktop system was finished in about three days.

"This is a great success for the field of bioinformatics--one of the first examples of the discipline taking full advantage of a Grid-based system," says Dan Reed, director of NCSA and the Alliance. "One of the things we have learned over the life of the Alliance is that we're at our best when we form multidisciplinary teams and give those teams a clear mission as they focus on the deployment of technology. So this is also a great example of what the Alliance expeditions teams can do and how those teams can make contributions to a project from end to end."

Tedious and difficult becomes automatic and reliable

Before analysis can start, GADU has to cull the sequence data from specialized databases such as GenBank. Via a Web-based interface, GADU users select the databases that they are interested in, the bioinformatics tools they want to employ, and the frequency with which they want their data analyzed. On that schedule and with that list of goals, GADU compares the content of the selected databases to the data that are already stored on the user's local system or a designated public server. Any new content is downloaded, old genome records are updated, and files for any new genomes are added to the user's library. Alternatively, the user may choose to be notified by email of new content and manually select data to be taken from the databases.

Using Argonne's Globus toolkit, the sequence data are formatted so they can be easily passed to Grid-based machines for analysis. Annotation data--notes on what other scientists have learned about the sequence--are parsed and stored separately.

"What was tedious and difficult becomes automatic and reliable," says Maltsev.

Once the new sequence data are found and taken from the public databases, GADU submits them to the Grid's computational resources for analysis. GADU currently supports three of the most common bioinformatics tools that reside on machines around the world: Blast, Blocks, and Pfam. These tools, which are variations on a theme, do the painstaking work of assessing the sequences and finding variations. Biological sequences for which the functions are unknown are compared to those with known functions using computationally intensive algorithms. The analysis is automatically checked at various points in the process. If any of these checks fails, the analysis of that segment is aborted and restarted.

"We're talking about a humongous number of comparisons," says Maltsev. "An average bacteria genome has 4,000 genes that encode proteins." Compounding that, "six or seven genomes are acquired every month, and the pace is ever increasing."

Storage represents GADU's third function. The system places the annotations that accompany the public data in permanent storage, holds the files that are to be analyzed, and formats the results so they can be searched easily. An archive is also kept as GADU's acquisition module collects updated versions of the same gene sequence over time. Finally, the system needs temporary storage where intermediate versions of the data are kept during analysis. This space was provided by NCSA during the Argonne team's April run. Chimera, a data- management tool developed as part of NSF's Grid Physics Network (GriPhyN), controls the flow of these data and properly labels and catalogs them for future search.

A change in the sociology of science

A prototype of a GADU-based genome analysis server, including a public portal that will provide access to genome data that have already been analyzed, will be released by early fall. But the tool's implications are already fully realized in the minds of its creators. Most obviously, it makes the systems that bioinformaticists need for their massive calculations easy to use.

The system's public portal, which is being developed by the Alliance's scientific portals expedition and scientific workplaces for the future expedition, expands GADU's benefits.

Bioinformatics has historically been something of a cottage industry. Research groups build their own in-house computers to crunch their data. This situation leaves those at smaller universities and groups with less funding to struggle. With GADU's portal, however, "anyone can use precomputed results [that are in GADU's database] or they can use the secure facilities of the public server to process their data. It will eliminate huge amounts of redundant work that people are doing," according to Maltsev. "I personally believe [this sort of thinking] will completely change the sociology of science."

"It will provide access for a lot of scientists who couldn't even dream about this to the capabilities of large computations. For the sciences that rely on the computation of huge amounts of data or complicated simulations or huge models, it will provide the framework. People will be free to quit thinking about the framework and implement the scientific part."

Ian Foster--associate director of Argonne's mathematics and computer science division, co-leader of the data quest expedition, and a member of the Alliance's Executive Committee--concurs, pointing out that a framework like GADU has the ability to impact other disciplines. In a recent issue of BioInform, a bioinformatics newsletter published by GenomeWeb, he said, "Everyone thinks they're special [when they're moving their applications to grid-based systems]…In some sense they're not because the basic technology requirements are the same."

Funding information This research is supported by the National Science Foundation, the National Institutes of Health, the National Computational Science Alliance, and the University of Chicago.

Access Online URL http://access.ncsa.uiuc.edu/CoverStories/GADU/

Team members Susan Coghlan Terry Disz Natalia Maltsev Zach Miller Michael Milligan Nika Nefedova Alex Rodriguez Dinanath Sulakhe Jens Voeckler Gregor von Laszewski Von Welch Mike Wilde


Top of Page

  |  Table of Contents  |