[ Table of Contents | NEXT ARTICLE ]

THE PARADOX OF WAREHOUSE PATTERNS
by Dr. Kamran Parsaye, Information Discovery, Inc.


Introduction

As I showed in Data Mining Beside a Warehouse in the previous issue of DS*, while data warehouse is a natural place for storing "data" a data mine is the natural place for performing influence related analyses. In this article I show how mining beside a warehouse avoids a phenomenon that I have called the paradox of warehouse patterns.

To understand the paradox, let us note that the concepts of "large warehouse" and "useful pattern" often interact in a seemingly contradictory way. On one hand, the larger a warehouse, the richer its pattern content, i.e. as the warehouse grows the more patterns it includes. On the other hand, after a point, if we analyze "too large" a portion of a warehouse, patterns from different data segments begin to dilute each other and the number of useful patterns begins to decrease! So the paradox may be stated as follows: "The more data in the warehouse, the more patterns there are, and after a point the more data we analyze the fewer patterns we find!"

A few simple examples easily clarify this further. First, consider a vehicle warranty database. In order to find patterns for customer claims it is essential to store details of each claim in a large data warehouse. But does it make sense to analyze all of the warehouse at the same time? Does it make sense to ask: "what causes brake problems?" No. In practice, cars are built at different plants and different models of cars use different parts -- and some parts are now discontinued. Moreover, over the course of years the parts used in cars change, so analyzing the entire warehouse may tell us less than analyzing part of it. What works best in practice is to analyze the claims for a given model year for cars built at a given plant -- again a segmentation task. Once again, the paradox of the warehouse comes into play here in that by analyzing all of the warehouse at once we reduce the number of useful patterns we are likely to get!

As another example, consider a large data warehouse that includes details of bank's customer accounts, marketing promotions, etc. There can be several business objectives for mining this data, including campaign analysis, customer retention, profitability, risk assessment, etc. To begin with, these are distinct business tasks and it does not make sense to mix the analyses -- hence each of the data mining exercises needs to be performed separately, and will require different data structures as well, because some are association analyses, some are clusterings, etc.

However, even the campaign analysis task itself should often not be performed on the entire warehouse. The bank may have undertaken 30 different marketing campaigns over the years, and these campaigns will have usually involved different products and gone to different customer segments -- some of the products are even discontinued now. To understand who responds best to marketing promotions, we need to analyze each campaign (or group of campaigns) separately because each case will involve patterns with distinct signatures. Mixing the analyses into one data mining exercise will simply dilute the differences between these signatures. And the campaigns are often different enough that mixing them simply may not make sense. So we need to have a separate "Analysis Session" for each group of campaigns.

To demonstrate this with a simple example, let us assume that those customers who are over 40 years old and have more than 2 children have a high response rates to credit card promotions. Now, let us also assume that customers who are less than 40 years old and have only 1 child are good prospects for new checking accounts. If we combine these campaigns within the same data mining study and simply look for customers who have a high response rate, these two patterns will dilute each other.

Of course, we can get a rule that separates these campaigns and still display the patterns, but in a large warehouse so many of these rules will appear that they will overwhelm the user. Thus, the smaller patterns may be found in the warehouse if we are prepared to accept large amounts of conditional segment information, e.g. "If Campaign = C12 and ... Then ...". However, in a large warehouse, there are so many of these that the user will be overloaded with them. The best way is to analyze each group of campaigns separately.

The need for segmentation is even more clear when we consider predictive modeling. When trying to predict the response to a new campaign, it simply does not make sense to base the predictions on all previous campaigns that have ever taken place, but on those campaigns which are most similar to the one being considered. For instance, responses to campaigns for a new checking account may have little bearing on responses to campaigns for a new credit card or a refinancing a home. In this case, the paradox of the warehouse patterns comes into play in that by considering more data, we lose accuracy. This is, of course, because some of the data will not be relevant to the task we are considering.

But what happens if there are one or two key indicator that are common to all of the campaigns? Will they be lost if we just analyze the campaigns a few at a time? Of course not. If a pattern holds strongly enough in the entire database, it will also hold in the segments. For instance, if the people with more than 5 children never respond to campaigns, this fact will also be true in each individual campaign.

Hence, most of the time it does not make sense to analyze all of a large warehouse because patterns are lost through dilution. To find useful patterns in a large warehouse, we usually have to select a segment (and not a sample) of data that fits a business objective, prepare it for analysis and then perform data mining. Looking at all of the data at once often hides the patterns, because the factors that apply to distinct business objectives often dilute each other. Hence, the thirst for information can go unquenched by looking at too much data.

The Concept of an Analysis Session

When using a datamine, we bring a segment (and not a sample) of data from a warehouse (or other sources) to the datamine and perform discovery or prediction. The process of mining this data segment is called an Analysis Session. For example, we may want to predict the response to a proposed direct mail campaign by analyzing previous campaigns for similar products, or we may want to know how customer retention has varied over various geographic regions, etc.

An analysis session may be either "structured" or "unstructured". A structured session is a more formal activity in which we set out with a specific task, e.g. analyzing profitability by customer segments and/or products. In fact, structured sessions are often performed in a routine manner, e.g. we may analyze costs, revenues or expenses every quarter, and understand the reasons for the trends. Or we may routinely perform forecasting for various items such as product demand in various markets. Or, we may look for unusual transactions that have taken place in the past 30 days. In fact, a structured analysis session usually is of three forms: a discovery, prediction or forensic analysis activity where we perform a specific task.

An unstructured session is a "wild-ride" through the database, where the user wanders around without a goal, hoping to uncover something of interest by serendipity -- or by help from a "exploration-agent". This type of abstract wild-ride usually uncovers some very wild facts hidden in the data. And the mine is a natural place for this activity because the unexpected nature of queries may interfere with the more routine tasks for which the warehouse was designed, e.g. looking up the history of a specific claim.

The data in the datamine often needs to be enriched with aggregations. Again, let me emphasize that these are not just summaries, but additional elements added to the data. How these aggregations are built is partly decided by a business analysis. For instance, we may need to look at the number of credit cards a customer has as an item. And we may want to look at the "volume" of transactions the customer has had. We may also want to look at the number of claims a customer has had in an insurance setting, etc. These aggregations enrich the data and co-exist with the atomic level data in the mine.

References:
Parsaye, K. "Data Mining Beside a Warehouse", DS *.
Parsaye, K., "New Realms of Analysis". Database Programming & Design, April 1996.
"Large Scale Data Mining in Parallel", DBMS Magazine, March 1995.
Parsaye, K., Chignell, M.H.: "Intelligent Database Tools and Applications". New York: John Wiley and Sons, 1993 .
Parsaye, K.: "Data Mines for Data Warehouses". Database Programming & Design, September 1996.
Parsaye, K., "OLAP and Data Mining: Bridging the Gap". Database Programming & Design, February 1997.
Parsaye, K. "Machine-Man Interaction" DM Review, September 1997.


For more information, see http://www.datamining.com

[ Table of Contents | NEXT ARTICLE ]