[ PREVIOUS ARTICLE | Table of Contents | NEXT ARTICLE ]

DATA MINING: THE TWO CULTURES, PART I
by Robert Grossman


  1. Introduction

Data mining is about finding patterns in data. The importance of data mining has grown dramatically as the amount of archived and warehoused digital data has grown.

The historical roots of data mining come primarily from two different directions: from statistics and from artificial intelligence. The statistical culture in data mining emphasizes the role of predictive modeling (PM). The artificial intelligence culture emphasizes the role of knowledge discovery (KD). See Figure 1 (below).

In this article we discuss these two data mining cultures. This is the first in a series of several articles which will discuss some key issues in data mining from this perspective.



              ***                                   ***
        ***          ***                  ***                 ***

            Statistics                     Artificial Intelligence

        ***          ***       ---        ***                 ***

              ***      ---                ---        ***

---> PM Culture in DM <---  Data Mining   ---> KD Culture in DM <---

                       ---                ---

                                ---

Figure 1.

The two cultures in data mining. Data mining (DM) can be viewed as having two main fore bearers: statistics and artificial intelligence. The statistical tradition emphasizes predictive modeling (PM), while the artificial intelligence community emphasizes knowledge discovery (KD).

2. The Central Issue

The central issue is simple. To illustrate it, consider using data mining for fraud detection. In the PM tradition, given a credit card transaction, telephone call, or insurance claim x, the goal is to predict whether x is fraudulent as accurately as possible. This is usually considered to be a classification problem (0 means no fraud, 1 means fraud). A classifier examines the attributes of x (such as the number of transactions during the past hour) and returns a 0 or 1, indicating whether the transaction is fraudulent or not.

Generally, more accurate classifiers are more complex. For example, a good fraud classifier for a large data set using a tree-based classifier might contain thousands of nodes. At best this is difficult to interpret. This is a basic trade-off. In the PM tradition, increased accuracy is traded for ease of interpretation.

On the other in the KD tradition, the goal is to extract useful facts from large data sets. To be useful, these facts must be easily interpretable and easily actionable. For example, an algorithm for extracting shallow trees might reveal that low dollar transactions at cash machines outside of certain retail stores is highly correlated with fraud. The action here might be to put in place a rule which defers subsequent transactions for certain types of high priced retail goods. This illustrates another basic trade-off. In the KD tradition, ease of interpretation and implementation are traded for accuracy.

3. So What is Data Mining and the Data Mining Process?

Data mining is one step in the data mining process. The definition of data mining and of the data mining process differs somewhat between the two cultures.

A standard definition of data mining from the KD perspective is given by Fayyad, Piatetsky-Shapiro, and Smyth in reference 1 at the conclusion of this two-part article: "Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data."

Here is the PM perspective: Data mining is the automatic discovery of associations, clusters, changes, patterns, anomalies, and other significant structures in large data sets and exploitation of these to improve predictive modeling.

Recall that the data mining process consists of a sequence of steps, which are usually repeated in an iterative fashion (this will be discussed in greater detail in a subsequent article). The process typically includes, 1) data preparation and cleaning, 2) data warehousing, 3) identifying relevant predictive attributes, 4) computing derived attributes, 5) data reduction and attribute projection, 6) extracting patterns relevant to the predictive attributes using one or more data mining algorithms, 7-pm) predictive modeling, 7-kd) knowledge extraction, 8-pm) scoring of operational and warehoused data, 8-kd) interactive data analysis and discovery, 9) validation, report preparation, and related activities, 10) repeating the process as necessary.

Here Steps 7 and 8 are slightly different in the two traditions and this is indicated by using the suffixes pm or kd. Data mining projects usually incorporate aspects of both the PM and KD cultures. A common strategy is for groups of analysts and modelers in an organization to focus on the KD aspects of data mining and for IT staff with operational responsibility to focus on the PM aspects of data mining.

---

For more information, see http://www.magnify.com

---

The concluding portion of this article will appear in the upcoming edition of D S * .


[ PREVIOUS ARTICLE | Table of Contents | NEXT ARTICLE ]