THE DATA WAREHOUSE VS THE PATTERN WAREHOUSE: PART I
by Kamran Parsaye, Information Discovery, Inc.
Introduction
The concept of a data warehouse was championed in the 1980’s as a repository for corporate data elements. The idea was to create a central storage facility where everyone in the corporation could go and get "data" on demand, whenever they needed it. And the central repository would help increase corporate data quality and consistency because everyone obtained data from a single source. This idea has now achieved worldwide acceptance, and almost every Fortune 500 company has several datawarehouse projects.
In the 1990s it became clear that data in warehouses are often too coarse and unmanageable for detailed decision making -- business users needed much more refined knowledge. Moreover, most organizations realized that what they really wanted was the knowledge, trends and patterns within the data -- not the data itself. The concept of "data mining" hence gained momentum, and the need for knowledge extraction from data became widely accepted. Business users expected to get refined knowledge, not data.
We can view the progress of the field over the last 30 years in terms of a series of steps, each providing better and more refined information.
Once data warehouses were subjected to data mining, three immediate issues were encountered by business users. First, most business users found the technical details of the data mining task more than they had bargained for. Secondly, piece-meal and fragmented analyses on the central warehouse began to give inconsistent results -- 10 business users could get 8 different answers from the same data, depending on how they approached it. And the response-time for follow-up analyses from a large warehouse and the need for analyst intermediaries would often slow down the process of knowledge extraction. Using a Pattern Warehouse(TM) as an adjunct to the data warehouse solves all of these problems -- and provides additional benefits.
Data vs Patterns
Data is rough, knowledge is refined. Business users often want the refined knowledge, not the data. As a simple analogy, consider data as grapes and knowledge as wine -- data mining is then like the wine making process. While a data warehouse is a storage facility for grapes, a Pattern Warehouse is like a wine cellar. Data mining tools are then like wine making equipment. Although users can make their own wine by getting grapes from the warehouse, this takes both time and know-how -- and naturally most business users prefer bottled wine. Note that with a Pattern Warehouse data mining still takes place behind the scenes, but the business user is unaware of it.
A Pattern Warehouse is a repository that holds historical patterns rather than historical data. With a pattern warehouse, almost all the relevant patterns in the data are found beforehand, and stored for use by business users such as marketing analysts, bank branch managers, store managers, etc. Business users get the interesting patterns of change every week or month or can query the Pattern Warehouse at will.
Because of disk space limitations, many organizations only store 12 to 18 months worth of historical data -- and in some cases there are so many transactions that data for only a few months is actually available. However, because knowledge is so much more compact than data, the Pattern Warehouse is only a fraction of the size of the data warehouse, allowing the patterns of many years to be stored with ease, even when the data is no longer available.
To get a perspective on the time and space scales, consider an example where the recent operational data refers to one month of a bank’s customer information, while the historical data in the data warehouse goes back 1 year. However, the historical patterns in the pattern warehouse may go as far as 5 or 10 years and still be a small fraction of the size of the data warehouse. This provides a huge amount of knowledge over time at a low cost for disk space -- and response time is far better than the data warehouse because the patterns have already been extracted, ready for look-up. This provides an environment for long term corporate knowledge management.
Data Analysis vs Pattern Storage
To distill information from a database we obviously need to perform analysis at some time. The key question is: "When?" In other words, does the analysis take place at the time the user needs the knowledge, or is it done beforehand, with the knowledge ready to access? Traditionally, data mining analyses were performed upon user request. The knowledge access paradigm rescues users from delayed analyses by pre-mining refined knowledge. Hence there are two distinct paradigms for empowering users with knowledge:
The Data Analysis Paradigm: in which users operate on data to discover information. This paradigm relies on the "analysis on demand" approach, i.e. when a user wants knowledge, analysis is performed.
The Knowledge Access Paradigm: in which the analysis is automatically done beforehand, refined patterns are pre-generated and users just get knowledge when needed, i.e. the "knowledge on demand" approach.
The knowledge access paradigm provides a multitude of benefits to the business user:
Easy to Use, yet Powerful: Business users without technical know-how can access knowledge without training -- they just click a graphic user interface from within a web-browser. And the knowledge access approach is more powerful because multiple types of powerful patterns are automatically merged to answer serious questions. With the analysis paradigm, business users inevitably rely on simple models and cannot deal with complex situations on their own.
Fast Response and Overall Efficiency: When a user requests knowledge, no analysis is needed and follow-up questions are answered quickly, without delay. Data mining on a very large database may take time, but pattern look-up is fast. And, because patterns are not re-computed each time for each user, the overall system efficiency is much higher. Computations take place once, and users access the refined knowledge again and again with ease.
Accuracy and Quality: Because sampling and extract files are avoided, the discovered patterns correspond to the entire database and have high accuracy, resulting in better decisions. And, because patterns are stored in a single repository, all users get similar answers, rather than relying on fragmented analyses. This is in contrast to the data analysis paradigm where different users may draw different conclusions from the same data.
Condensed Information: Because of disk space limitations, many organizations only store 12 or 24 months worth of historical data. However, because knowledge is so much more compact than data, the Pattern Warehouse is only a fraction of the size of the database, allowing many years’ worth of patterns to be stored with ease, even when the data is no longer available.
Up-to-Date Knowledge: Because the Pattern Warehouse is incrementally updated, recent patterns are always available. With the data analysis paradigm, there is usually not enough time to continuously analyze new data and often users are forced to rely on out-of-date analyses.
The knowledge access paradigm is a truly revolutionary idea with a multitude of business and technical benefits that reinforce each other. It avoids the probability of 100 users getting 100 different answers from the same data, because now corporate knowledge is centralized. To get a sense of the benefits, consider the following comparison.
The Way it Was Without a Pattern Warehouse
Users had to manipulate raw data to find patterns with substantial effort.
Users needed training, or had to rely on analysts for pattern discovery.
Pattern analysis was adhoq and fragmented, results often varied from user to user.
Analytical reports were cryptic and hard to understand. Often no explanations.
Turn-around time for follow up questions was long after a first analysis.
Analysis was performed on extract files. Patterns were missed or were unreliable.
The Way it Is With a Pattern Warehouse
Patterns are found beforehand, users just access them with great ease. No training.
Users just click on a graphic user interface for pattern query.
Patterns are stored in a central repository. All business users get uniform answers.
Reports are in plain English text with graphs automatically generated on the intranet.
Turn-around is instant because patterns are just looked up, not re-computed each time.
The entire database is analyzed. Powerful and accurate patterns are found.
---
Part II of this commentary will appear in the next issue of DS*.
---
For more information, see http://www.datamining.com/.