BUILDING BRIDGES BETWEEN FORMAL STATISTICAL ANALYSIS
AND DATA MINING TECHNOLOGY
by Ed Colet
Data mining and the broader area of knowledge discovery from databases (KDD) has been recognized as being a multi-disciplinary field. One of the obvious disciplines that can be associated with data mining is formal statistical analyses. Last year, the premier data mining technical conference, KDD-97, was intentionally co-located with the American Statistical Association's (ASA) primary conference in order to encourage interaction. The data mining sessions at ASA for both the tutorial and technical papers were extremely well attended and well received. Roughly one year later, it's apparent that there could still be a greater amount of collaboration between practitioners.
Despite the obvious relationship and interest among researchers, there seems to be a separation between practicing statisticians and active data miners in industrial settings. Many organizations have entire departments devoted to statistical analysis but are wary of incorporating data mining initiatives. By the same token, organizations often commit to investing in data mining tools and technologies (often in conjunction with their data warehousing efforts) but do so under the aegis of the Information Technology departments without involving their statisticians and/or analysts. But in order for data mining to provide significant competitive advantages for an organization then there should be closer linkages between data mining technology and statistical methods.
In order to foster a healthy collaboration between the two, one needs to be aware of the fundamental differences that exist between data miners and statistical analysts. These differences exist in part because of historically different roots that philosophically influence how and what one does.
Different roots, different concerns, but the same overall goal. To start at the beginning, the roots of statistical analyses come from Mathematics (some purely mathematical statisticians even go so far as to view collected data as irrelevant!). Statisticians use math as a tool to gain an understanding of the underlying process(es) that generated the data. Data mining has roots in the much younger field of Computer Science, with strong ties to AI and database systems. An objective is to develop powerful systems, tools, and algorithms that efficiently process information. Ultimately, both statisticians and data miners share the same goal of being able to detect and describe trends and patterns in data and if possible, to infer underlying causality and make generalizations. But the steps along the way to achieving this overall goal are fundamentally different.
A LITTLE BIT OF "BEEN THERE, DONE THAT"
Since they share the same goal, it's not surprising that the activities of data miners and statisticians frequently overlap. But statisticians sometimes view data miners as a bunch of computer scientists re-inventing (badly) techniques that have been in practice for decades. For example, a recent data mining paper about finding associations referred to (invented) a notion of correlation where values less than 1.0 indicated a negative association! This is in direct contrast to the common understanding among statisticians of correlation as a range of values from -1.0 to +1.0. This example of the use of a familiar term for a familiar concept but in an unfamiliar way, suggests that data miners would be better served by using tried and true techniques from statisticians.
CONTROLLING THE SOURCE: DATA COLLECTION
Statistical analysis has traditionally addressed issues regarding data collection that precede any data analysis being conducted. "What type of experimental design is best?" "What are the dependent and independent variables?" "How many customers/users do we need to collect data from?" etc. Entire sub-disciplines with experts on survey and questionnaire design now exist. On the other hand, data mining is often undertaken on data that's already been collected and is readily available. Often the data were collected for a different purpose altogether, or are available as a byproduct of transactional systems that can now conveniently store them in databases. One result is the recognized importance of the "data-cleaning" effort that precedes data mining. In this case, one might not simply be able to use tried and true techniques from statistics. With extremely large heterogeneous data sets collected through methods that may not be known -- what formal mathematical techniques are appropriate? There seems to be a ripe opportunity for active collaborative research on this from all parties.
HOW TO LOOK FOR PATTERNS
According to statisticians (especially in the social sciences), data analysis should be hypothesis driven, (e.g. test if men spend more than women for product X). In contrast to this, data mining is data driven. One may not know what questions (hypotheses) to ask of the data due to the sheer volume of data (large number of attributes and/or observations). Also, it may not be practical to formulate and query all possible hypotheses. The reason that data-driven analyses is viewed pejoratively by statisticians is that if one pokes around in one's data in order to formulate a hypothesis to test, then the hypothesis that is tested is ultimately biased in some way by the subset of data that's been explored.
In contrast, because data mining theoretically allows one to essentially explore the entire data set, it's not possible to introduce biases by virtue of looking at only part of the data. But even with a very specific hypothesis and a carefully constructed statistical model, it's sometimes apparent that another factor (either present in the data set or not) can be at work (e.g. as indicated by the analysis of residuals in regression), and one's hypothesis driven statistical model does not fit the data.
Clearly, one should always explore the entire data-space of possibilities if one is able to because this presents the best possibility for discovering new knowledge. Statisticians would be well served to take advantage of some of the newer data mining algorithms that allow one to effectively investigate larger data-spaces. By way of analogy, one shouldn't limit oneself to looking for a set of keys under the lamp post just because that's where the light is!
DATA SET SIZE AND PERFORMANCE CONCERNS
Statistical analysis has been concerned with relatively smaller data sets than those associated with data mining. Smaller data set are arrived at by virtue of sampling methods or the result of a specific experimental design. Smaller data sets are more manageable for the subsequent analytical computations which historically were initially conducted by hand, then calculators, and now via sophisticated software. In contrast, one of the defining features associated with data mining has been the large size of the data sets -- megabytes and terabytes.
Much effort is devoted to ways to efficiently process large data sets via pre-computed aggregations (e.g. OLAP) or other techniques such as condensed representations using frequent sets. Again, there seems to be ripe opportunities for collaborative research on developing new methods to analyze large scale data sets. Because much of the algorithms associated with data mining are designed to efficiently process large data sets, a primary concern has been with regard to performance and scalability. In contrast to statistical analyses, the mathematical techniques for detecting trends in the data are therefore somewhat computationally simpler, or less mathematically rigorous.
Formal statistical analyses and methods are explicitly linked to the properties and characteristics of the data, with explicit assumptions about underlying distributions and such. Their methods also account for the possibilities of random or chance processes operating, and incorporate probability considerations. One reason that computational methods in data mining aren't as mathematically rigorous simply has to do with performance concerns.
Algorithms that also check for possible violations of underlying assumptions (e.g. for Gaussian distributions, etc) would adversely slow system performance. But without examining the properties of data, the validity, reliability and ultimately the actionability of results can be affected.
SEEING WHAT'S IMPORTANT
Visualization has been recognized as a significant component to data mining. This is driven by the need to find effective ways for end-users to interpret complex trends and patterns in large amounts of data. Visualization also takes advantage of the availability of powerful systems that can now render complex graphics very quickly. Statistical analyses has its share of graphical techniques, but in practice, graphics have often been used to display results after formal quantitative methods have been applied. For example, a correlation coefficient (assumes linearity) is frequently calculated and interpreted while a simple scatter plot can make non-linear trends in the data apparent.
The name of the game is to detect and make use of information buried within
the data. A goal would be to transfer some of the formal rigor associated
with statistics into the data mining techniques, while capitalizing on the
computational power associated with data mining technology. An obvious
implementation would be a two step sequence in which apparently interesting
patterns are discovered via efficient algorithms, and these are then
subsequently subjected to more rigorous validation techniques. It should
go without saying that both active data miners and practicing statisticians
should be involved.
---
For more information, see http://www.virtualgold.com