[ Table of Contents | NEXT ARTICLE ]

MAKING A DECISION BASED ON THE RISK OF BEING WRONG
By Ed Colet


The objective of data mining software is to find hidden patterns in data. With large data sets, it is necessary to rely on data mining's powerful algorithms and data processing routines to process large data sets efficiently. But the determination of what constitutes an interesting pattern is usually based on a quantitative threshold implemented as part of the data mining algorithms. Set the threshold too low and too many patterns are presented to the user as interesting, many of which are therefore erroneously classified as such. On the other hand, a threshold that is set too high, means that too few patterns are presented as interesting, and therefore some meaningful patterns remain hidden in the data away from the user. Whatever the threshold value, there is always a risk of error. How might valid decisions be made despite the pervasive possibility of error?

The threshold to classify something in a dichotomous way (in this case, as "interesting" or "not interesting") is usually expressed in terms of the probability of making a Type I error. A Type I error is a misclassification in which something is marked as interesting when in reality it is not (i.e., the pattern is really due to chance). A Type I error is also known as a false positive. Conventional practice usually accepts a 5% error risk as an acceptable threshold. This means that 5% may be classified in error.

Classifying a pattern as not interesting when in reality it is interesting (i.e., the pattern is not due to chance) is called a Type II error. Think of it as a "miss". Unlike a Type I error, the probability of making a Type II error can not be known exactly - but a value can be computed. This would involve making some assumptions about the shape and placement of the underlying distribution representing "objective reality", whose actual shape and placement is known only after the fact. Nevertheless, the relationship between Type I and Type II errors is straightforward. If you decrease the probability of a Type I error, you increase your probability of making a Type II error and vice versa. In a data mining analysis, how does one know what the appropriate threshold value should be and therefore, what the tradeoffs are?

One way is to have the user or analyst determine the threshold value in an analysis. This is effective if the user/analyst has knowledge of the domain, including the inherent tradeoffs associated with possible errors.

But in the context of data mining, it is quite possible that the analyst doesn't know the domain or the data and is using data mining to explore the large data set. If so, it would seem that a default threshold level has to be set - perhaps according to conventional statistical practice. This threshold might then be set to the value 0.05. This would appear to have the advantage of having a history of conventional practice in support of this value. It would also appear to provide a usability feature, in that the user can "safely" accept the default threshold value. But it is also possible that conventional practice may not be appropriate for this particular data set and/or the domain that it's drawn from. The interpretation of data mining results and the subsequent decisions have to then be made very carefully.

A strategy for decision making would then explicitly take into account costs or risks associated with being wrong. In certain contexts, a Type I error (a false positive) is worse than a Type II error (a miss). Because much data mining software is customized for a particular domain, it should be possible to also present the risk of errors in terms of a customized cost measure. Decisions that are made that minimize the costs associated with each type of error can help the user determine the appropriate course of action. Consider a hypothetical example of data mining for a marketing campaign. Detected patterns suggest that a certain demographic group may be receptive to a marketing campaign. A Type I error will mean the company launches the campaign, when it should not have and the losses are the cost of the campaign. A Type II error will mean the company does not launch the marketing campaign when it should have, and loses out on the opportunity for profit. In this case, the decision that has lower costs associated with these errors is therefore the better one to take.

Explicitly considering the possibility of being wrong can therefore lead one to make the right decision.

---

Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]