[ PREVIOUS ARTICLE | Table of Contents | NEXT ARTICLE ]

DATA WAREHOUSING SYSTEM ADMINISTRATION, A BUSINESS-CRITICAL SYSTEM, PART IV
by Nigel Chang


Disk Striping and Disk Space Management

Disk striping is a must in data warehousing to improve I/O performance. However, striping would result in lower data availability. The systems administrator, prior to disk volume grouping and configuration, has to understand the trade-off and figure out the best striping set. The following are some test results obtained from a major telecom data warehouse environment:

22-Disk Stripe Set, 4-Disk Stripe Set, and No Striping
Table 1:revenue data, 150MB, 1 million rows
Table 2:revenue data, 3,500MB, 20 million rows
SQL execution elapsed time
 Striping

 Table 1

 Table 2
 22-Disk  Select count(*)  Parallel 4  144 secs  7 secs
 4-Disk  Select count(*)  Parallel 4  178 secs  7 secs
 NO  Select count(*)  Parallel 4  31 mins  92 secs
 22-Disk  Select sum(amt)  Parallel 4  202 secs  9 secs
 4-Disk  Select sum(amt)  Parallel 4  198 secs  9 secs
 NO
 Select sum(amt)  Parallel 4  19 mins  65 secs
 22-Disk  Select count(*)  Parallel 1  10 mins  29 secs
 4-Disk  Select count(*)  Parallel 1  10 mins  20 secs
 NO  Select count(*)  Parallel 1  19 mins  65 secs
 22-Disk  Select sum(amt)  Parallel 1  15 mins  30 secs
 4-Disk  Select sum(amt)  Parallel 1  13 mins  30 secs
 NO  Select sum(amt)  Parallel 1  23 mins  43 secs
 22-Disk  CTAS  Parallel 4  N/A  N/A
 4-Disk  CTAS  Parallel 4  26 mins  1 mins
 NO
 CTAS  Parallel 4  41 mins  7 mins
 22-Disk  CTAS  Parallel 1  N/A  N/A
 4-Disk  CTAS  Parallel 1  100 mins  5 mins
 NO  CTAS  Parallel 1  154 mins  2 mins
Performance and Availability Comparison
22-Disk Stripe Set 4-Disk Stripe Set

No Striping

 Disk I/O Performance
(Response Time)
 9 mins (Best)  10 mins (Good)

 30 mins (Poor)

 Disk Availability  0  18/22  21/22
 (When 1 disk fails)  1-(22/22)  1-(4/22)  1-(1/22)
It is not uncommon for a data warehouse systems administrator to spend a substantial amount of time each day creating many and various raw devices or logical volumes. This is an area that could certainly be improved. In fact, data warehouse data are mostly partitioned by the time dimension. Data is also produced in a predictable pattern. It is easy to preallocate raw devices based on the data already stored in the Data Warehouse.

With intelligent scripts, raw devices can be estimated and created adequately - daily, weekly, monthly, quarterly, or yearly. We often utilize the RDBMS data dictionary (or system catalog) as a decision support system of the disk space management system.

File System Management
Data warehousing is not only about raw devices but also about cooked devices. Many file systems would be created to store data, other than tables and indexes. The so-called staging area, ftp file system, or loading zone is required as data is extracted from the OLTP system and is transferred to the data warehouse. This extracted data is usually in ASCII flat files and placed in UNIX file systems. Depending on the daily data volume, the staging area is usually 10 to 20 times bigger than the daily volume.

As a data warehouse project involves many different groups, duplicate file system requests from different organizations may waste valuable disk resources. A centralized control should be useful in file system creation.

Network Administration
Data warehouse environments not only store huge amounts of data but also transfer many big files. Thus, the network is easily jammed due to the never-ending data transfer. Typically, data extracted from the business operating system (OLTP) are scheduled to be transferred to the data warehouse during the night. This process can take all of the network bandwidth and last for several hours. Any other jobs relying on the network would virtually shut down.

Watch out for those fail-over features provided by certain hardware vendors. When the network is jammed, as the data warehouse will do, these features false alarm, due to slower response over the network. They often fail over when both primary and secondary machines are running well. Although the fail over is transparent to users, the batch jobs must be rerun and will cost more resources. To reduce the chance of false alarms, some systems administrators can only increase the "waiting period" or "response interval" and virtually shutdown the fail over feature.

Backup and Recovery
Data warehouse backup and recovery is a tough job. The systems administrator is in charge of OS level backup and recovery. It must be done with full and incremental backup. The DBA is responsible for RDBMS level backup and recovery. As many UNIX RDBMS vendors are struggling to make their backup-and-recovery systems match those of the mainframe system, the DBA has a hard time building his/her backup-and-recovery system.

---

For more information, see http://www.metaedge.com


[ PREVIOUS ARTICLE | Table of Contents | NEXT ARTICLE ]