|
Disk Striping and Disk Space Management
Disk striping is a must in data warehousing to improve I/O performance. However, striping
would result in lower data availability. The systems administrator, prior to disk volume
grouping and configuration, has to understand the trade-off and figure out the best
striping set.
The following are some test results obtained from a major telecom data warehouse
environment:
|
| 22-Disk Stripe Set, 4-Disk
Stripe Set, and No Striping |
Table 1:revenue data, 150MB, 1
million rows
Table 2:revenue data, 3,500MB, 20 million rows |
| SQL execution elapsed time |
| Striping |
|
|
Table 1 |
Table 2 |
| 22-Disk |
Select count(*) |
Parallel 4 |
144 secs |
7 secs |
| 4-Disk |
Select count(*) |
Parallel 4 |
178 secs |
7 secs |
| NO |
Select count(*) |
Parallel 4 |
31 mins |
92 secs |
| 22-Disk |
Select sum(amt) |
Parallel 4 |
202 secs |
9 secs |
| 4-Disk |
Select sum(amt) |
Parallel 4 |
198 secs |
9 secs |
NO
|
Select sum(amt) |
Parallel 4 |
19 mins |
65 secs |
| 22-Disk |
Select count(*) |
Parallel 1 |
10 mins |
29 secs |
| 4-Disk |
Select count(*) |
Parallel 1 |
10 mins |
20 secs |
| NO |
Select count(*) |
Parallel 1 |
19 mins |
65 secs |
| 22-Disk |
Select sum(amt) |
Parallel 1 |
15 mins |
30 secs |
| 4-Disk |
Select sum(amt) |
Parallel 1 |
13 mins |
30 secs |
| NO |
Select sum(amt) |
Parallel 1 |
23 mins |
43 secs |
| 22-Disk |
CTAS |
Parallel 4 |
N/A |
N/A |
| 4-Disk |
CTAS |
Parallel 4 |
26 mins |
1 mins |
NO
|
CTAS |
Parallel 4 |
41 mins |
7 mins |
| 22-Disk |
CTAS |
Parallel 1 |
N/A |
N/A |
| 4-Disk |
CTAS |
Parallel 1 |
100 mins |
5 mins |
| NO |
CTAS |
Parallel 1 |
154 mins |
2 mins |
| Performance and Availability
Comparison |
|
22-Disk Stripe Set |
4-Disk Stripe Set |
No Striping |
Disk I/O Performance
(Response Time) |
9 mins (Best) |
10 mins (Good) |
30 mins (Poor) |
| Disk Availability |
0 |
18/22 |
21/22 |
| (When 1 disk fails) |
1-(22/22) |
1-(4/22) |
1-(1/22) |
|
It is not uncommon for a data warehouse systems administrator to spend a
substantial amount of time each day creating many and various raw devices or logical
volumes. This is an area that could certainly be improved. In fact, data warehouse data
are mostly partitioned by the time dimension. Data is also produced in a predictable
pattern. It is easy to preallocate raw devices based on the data already stored in the
Data Warehouse.
With intelligent scripts, raw devices can be estimated and created adequately - daily,
weekly, monthly, quarterly, or yearly. We often utilize the RDBMS data dictionary (or
system catalog) as a decision support system of the disk space management system.
File System Management
Data warehousing is not only about raw devices but also about cooked devices. Many file
systems would be created to store data, other than tables and indexes. The so-called
staging area, ftp file system, or loading zone is required as data is extracted from the
OLTP system and is transferred to the data warehouse. This extracted data is usually in
ASCII flat files and placed in UNIX file systems. Depending on the daily data volume, the
staging area is usually 10 to 20 times bigger than the daily volume.
As a data warehouse project involves many different groups, duplicate file system requests
from different organizations may waste valuable disk resources. A centralized control
should be useful in file system creation.
Network Administration
Data warehouse environments not only store huge amounts of data but also transfer many big
files. Thus, the network is easily jammed due to the never-ending data transfer.
Typically, data extracted from the business operating system (OLTP) are scheduled to be
transferred to the data warehouse during the night. This process can take all of the
network bandwidth and last for several hours. Any other jobs relying on the network would
virtually shut down.
Watch out for those fail-over features provided by certain hardware vendors. When the
network is jammed, as the data warehouse will do, these features false alarm, due to
slower response over the network. They often fail over when both primary and secondary
machines are running well. Although the fail over is transparent to users, the batch jobs
must be rerun and will cost more resources. To reduce the chance of false alarms, some
systems administrators can only increase the "waiting period" or "response
interval" and virtually shutdown the fail over feature.
Backup and Recovery
Data warehouse backup and recovery is a tough job. The systems administrator is in charge
of OS level backup and recovery. It must be done with full and incremental backup. The DBA
is responsible for RDBMS level backup and recovery. As many UNIX RDBMS vendors are
struggling to make their backup-and-recovery systems match those of the mainframe system,
the DBA has a hard time building his/her backup-and-recovery system.
|