HPCwire
 The global publication of record for High Performance Computing - LIVEwire Edition / November 9, 2004: Vol. 13, No. 45A

  |  Table of Contents  |  

LIVEwire News Briefs:

PNNL's StorCloud Challenge Application Debuts At SC

The quantity of data generated by high-throughput proteomics work being undertaken at Pacific Northwest National Laboratory (PNNL) threatens to significantly overwhelm current-generation storage technologies. PNNL has partnered with SGI to utilize commodity storage and the Lustre filesystem as building blocks for its innovative next-generation archive which includes Active Storage.

The Lustre file system is a scalable object-based file system designed to serve very large clusters. The Lustre architecture involves a client node, a MetaData Server (MDS) to store file location and attribute information, and Object Storage Targets (OSTs) that store and serve the contents of the file itself.

Active Storage is a technology being developed by PNNL to exploit the otherwise under-utilized computational resources of Lustre OST servers.

StorCloud Challenge

As part of the StorCloud Challenge, PNNL is demonstrating a single Lustre file system striped across 40 OST servers each containing 24 400GB disk drives configured as 2 12-disk RAID0 sets. Formatted, this presents itself as a 348TB file system to client nodes.

This Lustre file system is based on a beta version of Lustre 1.4 that has been modified with Active Storage technology.

The application running within Active Storage is calculating and storing all possible amino acid sequences based on an input mass and tolerance observed as part of an experiment run at the Environmental Molecular Sciences Facility (EMSL) at PNNL.

For the StorCloud challenge, a master application running on a workstation in the PNNL booth is writing files to the Lustre file system that contain a target mass and tolerance. The Active Storage module reads this information and calculates and stores the resulting protein mass and sequence data to a 2nd file within the file system. Since the active storage component is running on the OST server where the disks physically reside, very little data needs to cross the network.

The progress of the StorCloud challenge application will be monitored from PNNL's booth at SC04 using a 3-dimensional cluster visualization application.

Storage Science Application

Proteomics is the branch of genetics that studies the full set of proteins encoded by a genome.

Currently, much effort is being placed on deciphering the composition of proteins in living systems by decoding the amino acid sequences found in cells. This information can be used to unravel the complex interactions between the components of living systems for drug discovery, curing diseases or engineering new biosystems for cleaning the environment, for instance. One widely used approach for getting information about proteins in biosystems is that of mass-spectrometry, a process that relates the mass of peptide fragments to their electric charge. Although functional proteins or peptides often adopt complex three-dimensional shapes, they are denatured, or "unraveled" before being put through the mass spectrometry process. For this reason, they can be simply represented as strings of characters denoting the order and types of amino acids.

One key aspect of mass spectrometry is the ability to relate mass of a fragment to its composition. For the purposes of the StorCloud challenge, our intention was to generate a "candidate list" of possible peptides or fragments corresponding to a given mass—the sort of information which emerges from a mass-spectrometry measurement. The extra compute power and storage capabilities afforded by active storage provide an ideal context to manage the combinatorial explosion of data resulting from this sort of calculation.

There are 22 unique amino acids. Each amino acid is typically represented by a letter of the alphabet. They have atomic masses ranging from 75.07 to 204.23 mass units, depending on its chemical composition.A peptide or protein is simply a sequence of amino acids joined end-to-end.

When a mass of say 1000.0 mass units is observed, there are 272 protein 'recipes' that add to exactly 1000.0. When arranged in all possible sequences, these 272 recipes can be arranged into 2,634,924 unique sequences. The resulting file is 41 MB. With a tolerance of 1.0 mass units, these numbers grow to 36,533 recipes, 437,800,841 permutations and 7,162,410,187 bytes (6.7 GB).

The goal of this work is to build reference table that could quickly identify a set of possible protein sequences based on observed experimental data. To do that, additional work will be required to filter out impossible protein sequences and calculate expected elution times for the remaining sequence data. All of this additional pre-processing could be implemented as Active Storage modules and distributed to the file system. Ultimately even the table scans to locate and summarize possible protein sequences could be implemented as Active Storage modules.


Top of Page

  |  Table of Contents  |