HPCwire
 The global publication of record for High Performance Computing - LIVEwire Edition / November 18, 2003: Vol. 10, No. 1

  |  Table of Contents  |  

GRIDS@SC:

TerraGrid: UNLEASHING CLUSTERED COMPUTING

Terrascale recently announced TerraGrid: a revolutionary new I/O delivery platform for Linux clusters and heterogeneous distributed computing environments. TerraGrid eliminates the requirement for custom-built distributed/clustered file systems currently required to facilitate a unified namespace and provides the highest performing, most scalable and easiest to manage I/O delivery platform currently available. TerraGrid has been designed from the ground up to increase performance, deliver linear scalability and provide ease of manageability of huge datasets, while enabling clients to leverage the vast body of open-source tools, utilities and applications currently available for Linux.

Implemented as an intelligent network driver, TerraGrid enables existing "standalone" open source file systems to be deployed on thousands of cluster nodes while simultaneously accessing a unified namespace at tens of GB per second of bandwidth and sustaining millions of I/O operations per second. TerraGrid delivers block-level parallelism, thereby ensuring that all meta-data and data accesses are striped across the installed pool of storage servers. Because TerraGrid is compatible with existing Linux tools and utilities, it can be installed and managed by Linux administrators with little or no additional training.

High availibility is built in at the core of the design when existing Linux software RAID facilities are appropriately configured. In most cases during failure, applications proceed normally to completion without disruption. Deployment is extremely flexible and employs low cost commodity off-the-shelf hardware components. No specalized box products, adaptors or equipment are required for deployment. Since TerraGrid is purely a software software product, clients are free to work with their hardware vendor of choice, thereby avoiding being locked into proprietary architectures or paying non-commodity prices for commodity parts. This approach ensures that TerraGrid deployments have the lowest TCO of any storage delivery platform. As CPU horsepower continues to accelerate up the ramp defined by Moore's Law and network performance continues to grow in log-scale leaps, a TerraGrid investment delivers even higher performance and actually provides higher returns year after year.

TerraGrid provides parallel data paths between the compute nodes and I/O nodes in a Linux cluster and does away with the requirement for metadata controllers, thereby eliminating the inherent performance and capacity bottlenecks found in other vendors' solutions. The result is that the even sub-entry TerraGrid deployments deliver multiple Gigabytes/sec of sustained bandwidth while competing vendors with products that cost a multiple of TerraGrid licenses continue to quote performance numbers in Gigabits/sec. Initial performance measurements show that each TerraGrid -enabled compute node can sustain 100 MByte/sec of sustained single stream I/O and can scale linearly to hundreds of nodes until either the network and/or the pool of I/O servers is completely saturated. The consequence of linear scaling is that TerraGrid, on the high end, delivers more than 100X the bandwidth of the alternatives.

TerraGrid also dramatically lowers the cost of managing data storage by supporting Gigabytes to Petabytes of data capacity growth within a single, easily managed namespace without requiring any dowtime on the client nodes. From a connectivity perspective, TerraGrid fully supports native file systems on Linux, NFS for Unix client access and CIFS for Windows client access. This enables seamless growth of a TerraGrid installation within all user environments while consolidating TerraGrid connected storage into a unified, fast and resilient namespace.

Architecture

The following figure provides an overview of a TerraGrid deployment:

  • TerraGrid Clients: Each client/compute node runs a Linux block device driver that, in conjunction with standard Linux utilities, presents the collection of I/O servers as a huge, scalable multi-ported virtual disk drive. The network switch shown in the above figure can provide connectivity to either a LAN or a WAN and must support TCP/IP. TerraGrid uses the iSCSI protocol for transmitting block-level requests to the pool of I/O servers, and in the context of iSCSI, each client is an initiator. An open-source Linux file system such as ext2 can be mounted on this virtual disk drive and will function as massively parallel and distributed file system. All meta-data and data is striped across the pool of servers by utilizing standard Linux software RAID facilities such as md, lvm and evms. If RAID5 is chosen on the clients, the file system is parity-protected and will continue to operate normally even if an I/O server completely fails. It is important to note that, in a TerraGrid deployment, all clients see exactly the same namespace. The size of the file system seen by each client is equivalent to the aggregate size of all the virtual containers defined on the I/O servers. It is therefore possible to configure, for example, a 100 TByte file system even if only 10 TBytes of space is physically attached to the pool of I/O servers. Assuming that there is an adequate number of I/O servers, TerraGrid clients will perform at wire speed while maintaining cache consistency across all clients accessing the pool of I/O servers.
  • TerraGrid Servers: Each I/O server in the above figure is a Linux server running Terrascale's TerraGrid iSCSI target device driver. Unlike any other product in this space, the I/O servers are directly addressed in parallel by all the client nodes without the need for meta-data controllers. From a client node's perspective, each I/O server appears to client-side volume management utilities as a disk drive. The TerraGrid iSCSI target device driver supports byte-range locking and ensures complience with Posix read/write semantics. Each I/O server is connected to physical storage that is mounted locally as a file system. A sparse file of arbitrary size is then defined as a block container and exported to the pool of clients. Consequently, it is possible to define a block container that is substantially larger than the aggregate storage capacity available for export, enabling huge file systems to be defined on the client side. As physical storage utilization reaches capacity, additional disk can be added to each server without having to disrupt any of the clients. One of the significant benefits derived from the fact that the block container on the server side resides on a file system is that data blocks are cached not only on the client side, but also on the target side, effectively delivering a huge multi-level cache to the entire cluster. By virtue of the fact that each client node is able to run software RAID5 when accessing the pool of I/O servers, it is no longer necessary to deploy expensive Fiberchannel switches or multi-ported RAID hardware in order to achieve a high degree of fault tolerance -- enormous cost-saving can therefore be achieved by deploying the most cost-effective storage available without compromising on enterprise-class availability and performance.

It bears mention that, in a TerraGrid deployment, the same node can act as a client and a server.

Architectural Benefits

Performance: TerraGrid eliminates the requirement for serial meta-data controllers and provides block-level striping of all meta-data and data accesses. Consequently, linearly scalable wire-speed bandwidth is delivered to all client nodes until the either the network or the aggregated physical storage are completely saturated. In preliminary testing, a testbed consisting of 12 single-CPU clients and 11 single-CPU servers connected to a Gigabit Ethernet switch, it was possible to attain 1,214 MBytes per second of sustained throughput even though each client was generating only a single stream of I/O requests. This translates to over 100 MBytes/sec of sustained throughput per client over a link with a theoretical limit of 125 MBytes/sec. The following figure provides a synopsis of performance measurements taken on a cluster with 12 servers:

Management: The architecture of TerraGrid provides opportunities to minimize or eliminate time-consuming management tasks. For instance, the core tasks associated with managing and optimizing data layout are eliminated. The use of software striping of all clients ensures that data is distributed evenly across I/O servers. Also, when additional physical storage is incorporated into the storage system, it can be incorporated into the existing namespace by merely creating a new volume and dynamically expanding the file system to encompass this volume. Traditional Linux administration tools are utilized to manage TerraGrid, elminating the need for specilized training of systems administrators.

Security: Storage has typically relied on private networks to guarantee the security of the system and authentication of the clients. Terrascale's TerraGrid enhanced Linux filesystem suports and enforces the standard Linux/Unix security features for read write and execute. In addition TerraGrid can deploy standards based authentication and encryption including Kerberos and IPSec. This comprehensive security gives you the confidence to use more easily accessible networks, such as Ethernet for storage transactions.

Standards: iSCSI, approved as a standard in early 2003 and is in the process of becoming a mass-marketed technology. Terrascale is committed to drive the deployment of standards for high performance storage platforms. Currently we are tracking the iSCSI working group and interacting with industry vendors to ensure that open-source implementations of iSCSI are available and driven through an open standards process. Block level protocols, and in particular iSCSI, are frequently hailed as the genesis of IP-based storage networking. While protocol developments are encouraging, they are only a piece of what is required to deliver the true benefits of cost-effective storage networking. A true IP SAN can deliver the benefits of shared storage-server consolidation, increased capacity utilization, increased performance, more efficient data protection, and high availability -- but much more is required than the base protocol.

Although there are many switch and fabric vendors, applications have not been able to effectively exploit fabric capabilities due to the absence of fabric enablers that permit real-world applications to perform efficiently and scalably. Terrascale's TerraGrid product is the first fabric enabler that bridges the gap between network hardware capabilities and achievable I/O throughput -- it turns your commodity TCP/IP switch into a massively scalable IO fabric and allows existing applications to instantly exploit the full power of the network with out any modifications to the application or the network.

Flexibility: TerraGrid net utilizes widely available TCP/IP networking technologies, commodity servers and serves as an enabler that unleashes the full power of open source Linux. Given that there are no proprietary bits and pieces required, a number of deployment scenarios are made possible, enabling the most optimal configuration to be chosen for a given site.

The three major deployment scenarios are:

Hierarchical Cluster: Consists of diskless client nodes and dedicated I/O server nodes. This configuration is best suited for larger clusters (>64 nodes) wherein management complexity is reduced by separating I/O nodes from compute nodes. The compute nodes access a common namespace via open-source Linux file systems such as ext2. A subset of the "compute" nodes can run NFS/CIFS as the application, providing connectivity to non-Linux clients running Unix or Windows. Fault tolerance is achieved by configuring each compute node to run RAID5 across the I/O servers, thereby allowing continued access to data even if an entire I/O server fails or is deliberately taken offline for upgrades.

Flat Cluster: This deployment scenario is ideal for smaller clusters (<64 nodes) wherein cost is of paramount importance but throughput and high availability are still required. In this scenario, each compute node runs the TerraGrid client and server -- locally attached storage (internal disk drives or external direct attached storage) is exported to all other nodes. The storage that is locally attached to each node becomes part of a global storage pool with a unified nodes. If software RAID5 is configured on each compute node, the failure of a single compute node will not bring down the entire cluster.

Scalable NAS: For environments wherein a scalable NAS platform is desired, a variant of the flat cluster, can be deployed -- "compute" nodes now function exclusively as NAS servers. In this scenario, clients and "plug and scale" NAS bandwidth by merely adding additional NAS servers with low-cost direct attached storage. All NAS clients will have access to the same unified namespace via NFS and CIFS protocols.

Summary Of TerraGrid Benefits

  • Unified global name space, scalable to PBytes of capacity, accessible from an open source Linux file system
  • 10s of GBytes/sec of sustained throughput, millions of IOPs per second and linear scaling
  • Fully parallel block and file access from clients to storage resources connected to the storage server pool
  • Automatic, transparent recovery from the failure of single, and optionally double storage nodes
  • Easy management using the suite of standard Linux tools and utilities (fsck, mkfs, lvm, md, evms, etc.)
  • Applications run unchanged by access through the standard POSIX semantics and library interfaces
  • Access to all of the features of the file system usually available under Linux (ACLs, permissions, etc)
  • Data security and integrity are delivered through the use of standard Linux methods, such as RAID, Kerberos and IPSec.
  • Transparent access to NFS, SMB and all other existing Linux services like SCP, FTP and RCP
  • Easy deployment through the use of standard Linux tools and administration methods
  • Multiple deployment models for optimum infrastructure efficiency and productivity
  • Runs on commodity servers, workstations and networks
  • Cost of ownership is typically dramatically lower than other solutions.

Top of Page

  |  Table of Contents  |