[ Table of Contents | NEXT ARTICLE ]

INTERPRETATIONS OF PROBABILITY
PART I: IRRECONCILABLE DIFFERENCES

By Ed Colet


The notion of probability is at the core of most data mining and data analytic tools. Yet there are two fundamentally different interpretations about the concept of probability. One approach is the empirical view also known as the long-run relative frequency, or objective view of probabilities. The other interpretation is the subjective view which incorporates a degree of belief or certainty about the likelihood of an event occurring. In part I of this two part-column, I comment on the differences between these approaches. Next week in part II, I will comment on how the two approaches can actually be reconciled- and in the context of data mining practices peacefully co-exist.

In the traditional interpretation of probability, probability is the long-run relative frequency of an event. The idea of the "long-run" is that if an event with a fixed set of outcomes was repeated many times over one could express in the form of an equation the probability of each event occurring. An example is the case of rolling a die. Because each outcome is equally likely, the theoretical probability of rolling a "6" (or any of the other 5 outcomes, "x") is equal to 1/6 or 0.16. This probability is purely theoretical in that one never really has to even roll a die. In the objective view, expected probabilities can be expressed as equations, and the particular sampling distributions are well understood.

A different approach to probability is the subjective view. In the subjective view, the ability to represent belief or incorporate existing information into one's conception of the problem is permitted. An example of this type of approach to probability is when one makes a statement like, "The St. Louis Rams will probably win the Super Bowl". The notion of probability here is clear. But this notion of probability appears very different from the earlier notion of long-run frequencies.

Adherents to the objective position object to the subjective view precisely because it's subjective. If there are different subjective beliefs, it's possible to have widely different interpretations and conclusions follow from an analysis of identical data. This cannot be good for any scientific or purportedly objective activity.

Adherents to the subjective view (such as Bayesians) object to the objectivists because the objective position requires one to ignore existing prior information and rely on purely theoretical probabilities to serve as the standard of comparison. They argue that one should be able to use all information available and try to get the most of the data in order for it to be truly useful.

In Part II next week, we'll see that the subjective view should really be viewed not as an alternative to, but as an extension of the more common relative frequency interpretation- and why in many cases the outcomes of the two approaches will be the same. We'll also see that practice of data mining and knowledge discovery actually incorporates both positions, and oftentimes results in truly surprising and interesting patterns.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]