THE DATA WAREHOUSE DATABASE EXPLOSION, PART I
by Sid Adelman
Data Warehouse Explosion
At a recent data warehouse (DW) conference, the discussion centered on the incredible explosion of the size of DW databases. Many companies are already implementing in the terabyte (1000 gigabytes) range and it's almost a macho boast that "my DW database is bigger than yours." There seem to be few who question the unbridled growth or ask if all that data is really needed, justified or if the data will even be accessed.
Conferences of Very Large Databases (VLDB) and tracks on VLDB at DW seminars attract speakers from vendors who promote the ability of their products to handle multiple terabytes. Attendees busily take notes and collect brochures that will help them with their technical evaluations of the products. These conferences provide little to help the attendees determine if their databases should be growing so large and so fast. The focus is on how to establish the VLDBs, not on how to control the growth or whether the VLDBs should be created at all.
There is the old joke of God telling a skeptical Moses about the commandments. When Moses asked the price, God told him they were free. Moses then responded, "I'll take ten." In the same way, when there is no cost to the users of storing data in the DW, when they are asked what data they want, they respond, "I want all the data." In most DWs today, the users are not charged for their use of the data or for the data they requested to be in the DW.
Users have a requirement to perform trend analysis and so historical data must be maintained. The question then is "How much data should be stored?" If it costs the users the same to keep five years of data as it does to keep two years, the users will ask for five. However, when they are charged the cost of the additional data, users will more carefully weigh the benefits and not automatically ask for the additional years.
The frequency of keeping historical data is also determined by the user. It may be necessary (or cost effective) to keep daily or weekly data but less frequent capture may be adequate. Keeping data weekly or monthly rather than daily or weekly can significantly reduce the data storage requirement.
For certain analysis, instance data is required. For other analyses, summarized data is adequate and instance data may not be necessary. Monitoring usage provides information on which mode of data is accessed.
In many DW implementations, there appears to be little governance of the data being migrated, the frequency of the updates, the storage of instance vs summarized data or any controls on the number of years being stored. Without clear standards the data will proliferate at a rate that will warm the hearts of the hardware vendors.
Reasons for the Explosion
The explosion has many causes. There often exist extract programs which pull the data from existing databases. These programs can be used without much modification to the extract portion of the process to migrate data to the DW. This means that all the data is delivered including data that will never be accessed.
In their haste to get something up and running, the process of accurately determining user requirements is either bypassed or done so rapidly and with such little user orientation and involvement that the analysts feel it is less risky to store all data rather than risk excluding some data that may be needed.
Some users play hard to get. It's difficult to schedule enough time with them to determine their requirements... Some analysts don't know how to ask about which data is required. Analysts are sometimes intimidated by users and bypass the requirements gathering to avoid talking to users. To be on the safe side, data analysts may choose to dump all the data from the operational systems to the DW.
Data may be stored in multiple places because new requesters may not know that the DW data is already there. This often comes from the absence of a populated and maintained repository of metadata or a lack of a methodology and standards that includes research into existing DW data. The responsibility for controlling what goes into the DW may not have been defined or the position may not have been staffed. Even when such a role has been identified and staffed -- usually the role is that of a Data Administrator -- the authority may not have been given to provide stewardship to the contents of the DW.
Another cause of explosive growth is the transformation of codes to values that are more easily understood by the users. For example, the code "M" in the Human Resource database would be transformed to "Male," the procedure 0402 in a doctor's office would be transformed to "Flu immunization." This transformation has resulted from the mistaken notion that the only way the user can see "Flu immunization" is for the words, not just the code to be stored in the database. The transformation need not reside in the data, it can be accomplished readily by many of the front-end query tools. Few organizations have standards that guide analysts in their decision as to where the transformations should be made.
A further cause of the explosion is the replication of the DW in more than one database or replication in a data distribution process. The data may have been legitimately replicated as part of a data mart strategy. The replicated data may be justifiable and well-considered or may be the result of a lack of a data architecture that includes guidelines for replicating data. Data may be replicated because of the availability of software that provides replication (Unused software is a thorn in the side of a purchaser.).
The wrong question may have been asked. "What data do you want?" Because the user is not sure and may feel that if data is not requested now, it may never become available. They will ask for all the data in the operational databases. However, if the question is phrased "Which data do you want first?" the user is more likely to carefully consider the answer and not be so frightened about a partial request foreclosing future data requests. The second phrasing also reinforces the recommended approach of not attempting to deliver all the data in the first phase.
Problems/Costs of the Explosion
Why should an organization spend their valuable time trying to control the size of their databases since it is so inexpensive to store the data? The hardware is getting faster and cheaper. The RDBMS are getting smarter in supporting VLDBs and performance is improving all the time. The answer is that the data explosion does result in increased cost, poor performance, unavailability, unhappy users and greater risk.
While disks are cheap, they still have some cost. The hardware to migrate the data and run the queries increases as the size of the database increases. As costs increase, systems may no longer be cost justified. The knowledge and size of the staff to support the VLDBs also increases as the databases get larger. It takes smart DBAs to design and manage a VLDB. Knowledgeable people are always in short supply and may be unavailable to the DW project.
Large databases almost always cause performance problems. The impact is on queries and on the loads/refreshes/updates. Smaller databases will take less time to load/refresh/update. Long load/refresh/update times may result in the DW being unavailable to the users in a timely fashion. They may not see yesterday's results until 11:00 AM. They wanted the information when they walked in at 8:30 AM.
Depending on the type of query and the indexing scheme, the query will almost always take longer with a large database and users may be unhappy with poor response times. If there is a chargeback process, the users will see much higher bills.
VLDBs with their performance problems, unhappy users, increased cost and availability problems magnify the risk of the project failing.
---
Next week: Part II -- What's an Organization to do?
For more information, see http://www.planxpert.com