DATA WAREHOUSING: AN OVERVIEW
Data warehousing is the technology trend most often associated with enterprise computing today. The term conjures up images of vast data banks fed from systems all over the globe, with legions of corporate analysts mining them for golden nuggets of information that will make their companies more profitable.
All of the developments in database technology over the past 20 years have culminated in the data warehouse. Entity-relationship modeling, heuristic searches, mass data storage, neural networks, multiprocessing, and natural-language interfaces have all found their niches in the data warehouse. But aside from being a database engineer's dream, what practical benefits does a data warehouse offer the enterprise?
When asked, corporate executives often say that having a data warehouse gives them a competitive advantage, because it gives them a better understanding of their data and a better understanding of their business in relation to their competitors, and it lets them provide better customer service.
So what exactly is a data warehouse? Should your company have one, and if so, what should it look like?
Essentially, a data warehouse provides historical data for decisionsupport applications. Such applications include reporting, online analytical processing (OLAP), executive information systems (EIS), and data mining.
According to W.H. Inmon, the man who originally came up with the term, a data warehouse is a centralized, integrated repository of information. Here integrated means cleaned up, merged, and redesigned. This may be more or less complicated depending on how many systems feed into a warehouse and how widely they differ in handling similar information.
But most companies already have repositories of information in their production systems, and many of them are centralized. Aren't these data warehouses? Not really.
Data warehouses differ from production databases, or online transactionprocessing (OLTP) systems, in their purpose and design. An OLTP system is designed and optimized for data entry and updates, whereas a data warehouse is optimized for data retrieval and reporting, and it is usually a read-only system. An OLTP system contains data needed for running the day-to-day operations of a business, but a data warehouse contains data used for analyzing the business. The data in an OLTP system is current and highly volatile, with data elements that may be incomplete or unknown at the time of entry. A data warehouse contains historical, nonvolatile data that has been adjusted for transaction errors. Finally, since their purposes are so different, OLTP systems and data warehouses use different data-modeling strategies. Redundancy is almost nonexistent in OLTP systems, since redundant data complicates updates. So OLTP systems are highly normalized and are usually based on a relational model. But redundancy is desirable in a data warehouse, since it simplifies user access and enhances performance by minimizing the number of tables that have to be joined. Some data warehouses don't use a relational model at all, preferring a multidimensional design instead.
Types of Data Warehouses
The type of data warehouse an organization adopts should depend on the way the business operates and the types of decision support it needs.
One of the simplest types of data warehouse, an operational data store (ODS) is a replicated production database that has been adjusted for errors. An ODS is used primarily to generate standard operations reports and to provide transaction detail for summary-level analysis. (Since an ODS replicates an OLTP system, some experts don't consider it a true data warehouse type. We've included it in our discussion, however, because ODSs fit the broad definition and many data warehouses contain them.)
Depending on an organization's reporting needs, an ODS may be updated monthly, weekly, or more frequently, sometimes almost in real time. Its main advantage is that it enhances production system performance, since reporting and query functions are off-loaded from the OLTP system to the ODS.
If your organization just generates operational reports and does little analysis or marketing research, an ODS may well suit your needs. On the other hand, if your company handles relatively few transactions per day, an ODS may be overkill. You can use the production system to generate reports instead.
Another type of data warehouse is the data mart. Data marts are limited in scope, usually taking their information from a single department or business process. They may be used for analyzing sales information in a specific region or for a particular product line, for example. Data marts usually contain only summary data, but they can be linked to operational data stores for drilling down to transaction details if necessary. They are sometimes managed by IT departments, but just as often they are managed directly by users in a department or work group.
While many OLAP applications can be performed on data marts, crossdepartmental analysis, executive information systems, and data-mining applications need information gathered from the entire enterprise to be most effective. The enterprise data warehouse is used for this type of extensive data collection and analysis. Because of its scope and complexity, the enterprise data warehouse is usually managed by the central IT group.
As its name implies, an enterprise data warehouse contains information taken from throughout an organization. This is the most complex type of warehouse to build and maintain, since data must be merged from multiple systems into common subject areas.
Different systems often produce incompatible or inconsistent data, and sometimes the data must undergo several transformations before it can be integrated into a data warehouse in a meaningful way. Inmon estimated that 80 percent of the time, building such a data warehouse would be spent on extracting, cleaning, and loading data.
If your organization can benefit from data mining, building an enterprise data warehouse may be well worth the effort. Data-mining tools work with various statistical techniques for modeling data and for estimating and predicting outcomes based on what they have learned. They work best with large data sets.
There are many ways to build an enterprise data warehouse. One form uses a centralized approach, combining all the enterprise's data in a single, large data store. Another form takes a distributed approach, pulling data from multiple data marts.
Data Warehouse Components
Although a data warehouse sounds like a single entity, it is really a multitiered, multiapplication conglomerate that comprises several components. Each component may be handled by one or more pieces of hardware or software. No vendor has a complete data warehouse package.
Functionally, a data warehouse extracts data from operational systems and loads it into a holding area where it is "scrubbed" (that is, made to conform with warehouse standards), merged, time-stamped or dated in some way, and loaded into databases for use by data access tools.
Since data goes through a number of transformations and is ultimately placed in data structures different from the ones it came from, those changes are mapped in catalogs or dictionaries. Such catalogs are managed with metadata tools. Data that defines or describes data in the warehouse is called metadata. There are typically two kinds of metadata. Things that users need to know, such as table and column names and definitions, are called frontend metadata. Everything else, such as how a particular data element maps to its original database, is backend metadata.
Design and management tools are also important components of a data warehouse. Even though it contains nonvolatile data, a data warehouse is far from static; you can't just build one and forget about it. A data warehouse is a high-maintenance system that expands as the company's demands on it increase. Even if the design of the warehouse itself changes little, the designs of the production systems that feed it are likely to change. These changes must be captured in the data warehouse metadata. Unlike an OLTP system, which gives a snapshot of the present, a data warehouse is meant to provide a complete picture of the data over time. This means that whenever any data structure in a production system changes, you have to capture both the change and when it occurred in the warehouse; otherwise queries against historical data may yield peculiar results.
Practical Considerations
Time and money. In a 1996 study published by IDC ("A Study of the Financial Impact of Data Warehousing," IDC, 1996), the average cost of building a data warehouse was $2.2 million, with an average time of 2.3 years to break even. Ninety percent of the companies in the study achieved greater than 40 percent return on investment (ROI), and 50 percent achieved over 160 percent ROI. The average ROI over three years, cumulative, was about 400 percent, with a higher ROI for data marts. Clearly, building a data warehouse can be profitable, but it's not a quick hit. Your company should be aware of the amount of investment likely before any payback can be expected.
Space. Data warehouses require lots of disk space. When estimating how much storage space you'll need, don't look at just the current production systems. Remember that you'll be building a system of historical data. Most organizations will want to keep at least a year's worth of data, and if you want to do any kind of trend analysis, you will need several years. Also, reports and analysis usually need more than one index. Be generous in your estimates of disk space. Data warehouses measured in terabytes are not unusual, and some large organizations are even using petabytes. (A petabyte equals 1,024 terabytes.)
Consolidation. Combining data from multiple sources may reveal incompatibilities or problems with your OLTP systems. Consistency is especially important for data-mining applications, since most datamining tools teach themselves how to analyze the data by looking at it. You may need either to do extensive data scrubbing or to fix the OLTP systems. Talk to users about their data needs, and decide whether the result justifies the cost.
Security. Security considerations for data warehouses are different from those for OLTP systems. For a data warehouse to pay for itself, lots of users have to be able to benefit from it, and therefore more users will need access to data than are traditionally authorized by OLTP security. According to experts, a "right to know" mind-set must prevail over the "need to know" philosophy, which would restrict access to data warehouses and minimize their effectiveness. If your organization can't make that cultural change, a data warehouse may be a wasted investment.
User-friendliness. Data warehouses have to be user-friendly. Since a data warehouse is not a production system, users don't have to use it. And the fastest, most reliable system won't be used if the user doesn't understand it or finds it cumbersome to work with.
Project planning. A data warehouse sometimes takes years to implement. Before embarking on a data-warehousing project, determine what your business objectives for it are, what the potential costs are versus the benefits, what resources you'll need, and how much organizational commitment you'll need. Without good planning and user buy-in, the project is not likely to be successful.
As more and more corporations come to appreciate that the information they gather each day is an asset, they will rely more and more on data warehousing. But while a data warehouse can provide managers with the means to ask questions of their data and get back meaningful answers, it can't automatically make a company more profitable. A good technology can't substitute for good management. Whether a data warehouse becomes a valuable strategic tool or an expensive white elephant depends as much on the organization using it as on the technology.