SMALL DATA, SMALL KNOWLEDGE:
THE PITFALLS OF SAMPLING AND SUMMARIZATION
by Dr. Kamran Parsaye, Information Discovery, Inc.
Introduction
When it is too daunting a task to look a large data warehouse straight in the eyes, it is tempting to try and obtain a smaller "sample" of the data for analysis. While sampling may seem to offer a short-cut to data analysis, the end results may often be less than desirable. The shyness to look at the whole data is often more expensive in the long term because we get lower quality information.
In the previous two articles in DS* I described how data mining could take place beside a warehouse and how the paradox of warehouse patterns suggested segmentation. But we must be very careful not to confuse segmentation with sampling. I like intelligent segmentation for a data mine, but not sampling. Segmentation is an inherently different task from sampling. As we segment, we deliberately focus into a subset of the data (e.g. select one model year for a car, or select one marketing campaign), sharpening the focus of the analysis. But when we sample data, we lose information, because we throw away data not knowing what we keep and what we ignore.
In this article I discuss why sampling should be avoided in data mining. The heart of the argument is simple: "The Data is there, so let is use it." The need for sampling a data warehouse usually comes from the fact that the algorithms used by a data mining system are not sufficiently mature to deal with the whole data set. Once the developers of the data mining system have had sufficient time to work on their algorithms, sampling is no longer necessary. This is not a future goal, it is possible now.
The Data is There, Use it
First, let us remember where sampling came from. Sampling was used within statistics in the real world because it was so difficult to have access to an entire population, e.g. one could not interview a million people, or one could not have access to a million manufactured components. Hence, sampling methods were developed to allow us to make some "rough calculations" without access to the entire population.
But does this not fly in the face of having a large database altogether? Of course it does. We build databases of one million customer behavior exactly in order to have access to the entire population. Else, we could just keep track of a small group of customers. The hardware technology for storing and analyzing large datasets provide an unprecedented opportunity for looking at historical patterns by making more data than ever before accessible for analysis.
At times, when we have a 100,000,000 record retail database, it may be suggested that a 100,000 record sample may be good enough. This is not so. Sampling will almost always result in a loss of information, in particular with respect to data fields with a large number of non-numeric values.
It is easy to see why this is the case. Consider a warehouse of 1,000 products and 500 stores. There are half a million combinations of how a product sells in each store. However, how one product sells in a store is of little interest compared to how products "sell together" in each store -- a problem known as Market Basket Analysis, e.g. how often do potato-chips and beer sell together. There are 500 million possible combinations here, and a 100,000 record sample can barely manage to scratch the surface. Hence the sample will be a really "rough" representation of the data, and will ignore key pieces of information. In using a small sample, one may as well ignore the product column! Hence we no longer have a large database, since in effect we have reduced it by removing fields from it. Hence sampling a large warehouse for analysis almost defeats the purpose of having all the data there in the first place!
Apart from sampling, summarization may be used to reduce data sizes. But summarization can cause problems too. The summarization of the same dataset with two sampling or summarization methods may result in the same result, and the summarization of the same data set with two methods may produce two different results. As another intuitive example of how "information loss" and "information distortion" can take place through summarization, consider a retail warehouse where Monday to Friday sales are exceptionally low for some stores, while weekend sales are exceptionally high for others. The summarization of daily sales data to weekly amounts will totally hide the fact that weekdays are "money losers", while weekends are "money makers" for some stores. In other words, key pieces of information are often lost through summarization, and there is no way to recover them by further analysis.
Having established some of the problems with sampling, let me also note that (as expected) there are exception to the rule. In other words, the answer to the question: "Do we ever sample the data for analysis?" is "In some rare cases, yes."
Sampling is sometimes recommended to get a general feeling for the data, and in such cases, one would recommend to have several samples and compare them. Sampling is only done when the computing power is not sufficient to manage the task at hand within a given time-frame.
The question of analysis on the entire database is a slightly different, yet related, matter. In some cases, segmentation alone will not give us all the answers, and we may need to look at some of the overall characteristics of the database. But even in these cases we need not perform full exploratory analysis, and can simply compare some of the distributions within the warehouse with those within the datamine.
Information Quality
As desktop data mining tools have become available, a serious issue with respect to sampling and information quality has emerged. End-users who have been told they can do data mining on their own get incorrect results, far more often than they realize.
In this scenario, the user some-how accesses a warehouse and samples or separates a moderate amount of data into a work-space and interactively analyzes it, e.g. builds decision-trees; or uses an interactive OLAP/ROLAP tools to visualize aggregations. In these scenarios the analysis is affected by a large number of factors which the user unknowingly controls, but are often random. These include dimension selection, sampling, tree-splits, etc.
As different users make different choices, the knowledge generated begins to vary, in the long term reducing corporate knowledge quality. As different users obtain different conclusion from the same data set, reliability vanishes, at times making the data warehouse a corporate liability rather than an asset.
References:
Parsaye, K. "Data Mining Beside a Warehouse", DS *.
Parsaye, K., "New Realms of Analysis". Database Programming & Design, April
1996.
"Large Scale Data Mining in Parallel", DBMS Magazine, March 1995.
Parsaye, K., Chignell, M.H.: "Intelligent Database Tools and Applications".
New York: John Wiley and Sons, 1993 .
Parsaye, K.: "Data Mines for Data Warehouses". Database Programming &
Design, September 1996.
Parsaye, K., "OLAP and Data Mining: Bridging the Gap". Database Programming
& Design, February 1997.
Parsaye, K. "Machine-Man Interaction" DM Review, September 1997.
For more information, see http://www.datamining.com