DATA WAREHOUSE AND DATA MINING 10.07.97 by W H Inmon D S *
In 1990 the phenomenon of data warehouse appeared and the world has not been the same since. Prior to data warehouses, computer systems were designed to capture, edit, and store detailed bits of data. These early systems are known as operational or legacy systems. While capturing and storing data efficiently in legacy, operational systems is certainly a useful thing to do, accessing and analyzing operational data is hardly easy or optimal. Data ended up in "jail" in the 1960/1970 style of legacy, operational system.
OLDER OPERATIONAL SYSTEMS
There are several important reasons why data was so hard to get to in the operational systems of yesterday. Operational systems are notorious for being unintegrated. There is no uniform understanding of who a customer is, what a product is, what a sale or transaction is and so forth. Each operational application has its own unique interpretation of the basic data that runs the corporation. Trying to achieve a uniform, consistent view of what is going on in the corporation is impossible with the many operational applications that a company has.
A second reason why operational systems are so inadequate when it comes to providing the corporation information is that operational systems focus on very immediate data. Operational applications deal with today's bank account balance, today's insurance policy coverage, today's location of a shipment, and so forth. Every operational unit of data is accurate as of the moment the data is accessed and used. But there is another important dimension of data that is not addressed by looking at today's data.
Historical data is equally valuable to the corporation for many purposes, but operational systems do not address the capture and storage of historical data. And why is it that corporations need historical data? There are actually many uses but one of the most obvious use is for the corporation to start to get a handle on the annual rise and fall of business from one quarter to the next. Without historical data - three, four, or five year's worth - corporations simply do not have a feel whether one quarter is really any better or worse than expected. A second reason why historical data is so valuable is that consumers are creatures of habit. When it comes to consumption, people form habits and patterns that are kept throughout their lives. Understanding what happened yesterday is a powerful predictor of what is going to happen tomorrow. And the key to understanding what happened yesterday is a carefully preserved collection of detailed historical data.
And there is a third powerful reason why operational systems are so inadequate to meet the information needs of the corporation. Operational systems pay almost exclusive attention to detailed data. Summary data is alien to most operational systems. However, summary data is the very substance management needs to run the corporation. Management does not want to see a mountain of detail in order to make decisions. Yet operational systems are 99% pure detailed data.
DATA WAREHOUSES
Data warehouse addresses the heart of the inadequacies of operational systems. Data warehouses contain -
In short data warehouses contain the very essence of what operational systems need in order to be useful for corporate information. In light of these important architectural features it is no surprise that data warehousing, along with the Internet, is the most important advance in the world of technology in the 1990's.
But building a data warehouse - while it is a most important step - does not guarantee success. Once the warehouse is built, it remains to use and exploit the warehouse. Data mining is the next logical step in completing the circle of effective DSS. With data mining, business patterns can be discovered, relationships between obscure variables can be examined, and long term trends can be detected. In short, data mining fulfills the expectations of data warehousing in many regards.
An interesting question that almost immediately arises is can data mining be done without building a data warehouse? Does a corporation really have to go the effort and investment of building a warehouse to start to use data mining technology successfully? The answer is that data mining can be done with no data warehouse at all. But, can data mining be done EFFECTIVELY in the face of no data warehouse? When effectiveness is considered, the answer is that data warehousing is absolutely essential for effective data mining.
Why is it that for corporations that are serious about data mining that a data warehouse is essential? Data warehouses prepare the raw data for analysis in an optimal manner. This preparation shows up very beneficially in many ways. One of the essences of a data warehouse is that data is integrated as it is placed in the data warehouse. This means that a lot of care is taken to bring uniformity and continuity to the understanding of common corporate objects, such as who is a customer, what is a transaction, and so forth. By building the data warehouse first, the data miner can dive into the analysis immediately and can start to achieve results immediately. But if the data miner does not have a data warehouse to operate from, then the miner must spend precious time (lots of precious time!) gathering the data, cleansing and scrubbing the data, integrating the data and so forth. It will be a long time until the data miner is set to even start the analysis portion of data mining if there is no warehouse.
A second reason why the warehouse sets the stage for success in data mining is that the data warehouse pays close attention to and collects and organizes historical data. The data miner needs a wealth of historical data in order to find the patterns and relationships that are of interest to the corporation. If there is no central collection of historical data like that that exists in the warehouse, then the data miner must go out and find the historical data to operate on. In some cases the data miner can find the historical data. But in other cases the historical data simply does not exist. When there is a data warehouse, the data miner can sit down and immediately start to work on the historical data inside the warehouse. The data miner is a long way from any meaningful analysis when the miner has to first gather and assimilate the data on which mining is to be done.
The third reason why data warehousing opens the door to effective data mining is that the warehouse contains both summary data and detailed data. Unquestionably, the miner needs the detailed data in order to do analysis. But the summary data is most useful in another way. Summary data is most useful at the outset of analysis when the data miner is planning an approach and needs to quickly look over the entire collection of detailed data. When there is a representative sample of different types of summary data, the miner can quickly survey what is and is not in the warehouse. The summary data can save the miner massive fruitless iterations of analysis.
EFFECTIVENESS IN DATA MINING
So it is true that data mining can be done without a data warehouse. But EFFECTIVE data mining cannot be done without a data warehouse. The investment in the warehouse sets the stage for effective data mining.
---
For more information, see http://www.pine-cone.com