Next Article Table of Contents Previous Article

INTERPRETATIONS OF PROBABILITY PART II: DATA MINING AND INTERESTING PATTERNS
by Ed Colet

Last week in part I of this series, I outlined two apparently fundamentally different interpretations of the notion of probability. Part II discusses how these different views are reconciled in the context of data mining, where probability notions are integral components of analyses. And as we'll see, surprising patterns are easily discovered.

To recap, one view of probability was the long-run relative frequency view held by the Objectivists. Expected probabilities are based on a theoretically long-run sequence of repeated observations. Often, the long-run frequency distributions are theoretical and specified by equations. In apparent contrast to this view was the subjective view in which expected probabilities can incorporate personal beliefs. Rather than deal with theoretical sampling distributions specified as equations there is a strong reliance on the use of simulation techniques to arrive at the applicable frequency distributions. With powerful computer systems, it has become feasible to simply run a simulation and then literally count the results.

With most simple probability problems the objective and subjective (via simulations) outcomes will be similar and the two approaches are for all intents and purposes the same. For example if one simulated the repeated rolling of a die, the observed probabilities will closely match the theoretical probabilities of each number coming up 1/6th of the time (although a simulation will rarely have each outcome coming out at exactly 1/6th). But for more complex problems, or problems incorporating prior knowledge and beliefs, arriving at similar results may not always be the case.

In the typical process of a data mining analysis, both objective and subjective views become reconciled. Data mining typically deals with large data stores. These large data stores can be used to represent a long-run history of events that have occurred. As such, they can be used as an empirical basis for determining expected probabilities. For example, a telecommunications company may have records that track the proportion of cell phone calls made by senior citizen customers. As such the expected probability that a call is made by a senior citizen can literally be based on an existing long-run history that is available in the data. Routinely comparing a sample proportion of recent calls made by senior citizens (and finding a large increase) against this long run history can be used as fraud detection mechanism. Large discrepancies between recent patterns and expected probability are flagged as interesting and subject to further investigation.

A common practice during a data mining and knowledge discovery analysis is the use of "what if" analysis. These "what if" analyses are essentially simulations, often used when there is no long-run history to draw from for answers. "What if" analyses allow the user/analyst to incorporate their extensive domain knowledge and beliefs. To continue our telecommunications scenario, the company may consider selling a fraud detection alert service to senior citizens that will prevent fraudulent activity associated with their cellular accounts. How successful might this product be? The company's analysts can run various what-if scenarios (varying price points, costs, customer acceptance, etc) to determine the probability of successfully selling this service/product. So, the combination of drawing upon long-run histories, and the use of powerful simulations reconciles what seemed to be fundamentally incompatible notions about probability.

It is also the case that solely relying on a long-run frequentist view can be limited. It cannot be used for modeling the probability of non-repeatable events -- because the idea of repeating the event over the long run is not possible. For example, the success of a marketing campaign based on probability of success (proportion buying) has to consider various factors such as the product, the business plan, the competition, the state of the economy, etc. The notion of holding things constant and rolling back time repeatedly and then counting the event's number of occurrences to determine a probability is not meaningful nor is it possible. What is possible is the use of simulations that can vary these factors. In general, simulations are a powerful approach to modeling complex problems because they can reveal interesting and surprising outcomes.

Simulations can quickly reveal results that are would be very difficult to derive from a purely theoretical (non-simulated) view. Parrando's paradox is an example of this. Parrando's paradox points out the surprising pattern that two games in which a player is guaranteed to lose can actually result in a winning streak if played alternately. Through the use of simulations it was discovered that if a player plays game A or game B 100 times, the player is guaranteed to lose all money. But by alternating the games (or even randomly switching between them) simulations routinely show that money accumulates into big winnings. There must be a direct, but subtle dependence between two games for this to work although the nature of this dependence is as yet not well understood. Until this dependence is understood, it is extremely difficult to determine a winning strategy without the use of simulations.

To conclude, in the context of data mining both the objectivist as well as subjective notions of probability are routinely employed in the analytical process to discover surprising but important patterns. The subjective view is thus really an extension of the objective perspective. And as Parrando's paradox illustrates, the use of simulations (popularized by subjective approaches) can easily discover surprising patterns that point to a winning strategy.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com.

Top of Page


Previous Article  |  Table of Contents  |  Next Article