INTERPRETATIONS OF PROBABILITY PART II: DATA MINING AND INTERESTING PATTERNS
by Ed Colet
Last week in part I of this series, I outlined two apparently fundamentally
different interpretations of the notion of probability. Part II discusses how
these different views are reconciled in the context of data mining, where
probability notions are integral components of analyses. And as we'll see,
surprising patterns are easily discovered.
To recap, one view of probability was the long-run relative frequency view
held by the Objectivists. Expected probabilities are based on a theoretically
long-run sequence of repeated observations. Often, the long-run frequency
distributions are theoretical and specified by equations. In apparent contrast
to this view was the subjective view in which expected probabilities can
incorporate personal beliefs. Rather than deal with theoretical sampling
distributions specified as equations there is a strong reliance on the use of
simulation techniques to arrive at the applicable frequency distributions.
With powerful computer systems, it has become feasible to simply run a
simulation and then literally count the results.
With most simple probability problems the objective and subjective (via
simulations) outcomes will be similar and the two approaches are for all
intents and purposes the same. For example if one simulated the repeated
rolling of a die, the observed probabilities will closely match the
theoretical probabilities of each number coming up 1/6th of the time (although
a simulation will rarely have each outcome coming out at exactly 1/6th). But
for more complex problems, or problems incorporating prior knowledge and
beliefs, arriving at similar results may not always be the case.
In the typical process of a data mining analysis, both objective and
subjective views become reconciled. Data mining typically deals with large
data stores. These large data stores can be used to represent a long-run
history of events that have occurred. As such, they can be used as an
empirical basis for determining expected probabilities. For example, a
telecommunications company may have records that track the proportion of cell
phone calls made by senior citizen customers. As such the expected
probability that a call is made by a senior citizen can literally be based on
an existing long-run history that is available in the data. Routinely
comparing a sample proportion of recent calls made by senior citizens (and
finding a large increase) against this long run history can be used as fraud
detection mechanism. Large discrepancies between recent patterns and expected
probability are flagged as interesting and subject to further investigation.
A common practice during a data mining and knowledge discovery analysis is
the use of "what if" analysis. These "what if" analyses are essentially
simulations, often used when there is no long-run history to draw from for
answers. "What if" analyses allow the user/analyst to incorporate their
extensive domain knowledge and beliefs. To continue our telecommunications
scenario, the company may consider selling a fraud detection alert service to
senior citizens that will prevent fraudulent activity associated with their
cellular accounts. How successful might this product be? The company's
analysts can run various what-if scenarios (varying price points, costs,
customer acceptance, etc) to determine the probability of successfully selling
this service/product. So, the combination of drawing upon long-run
histories, and the use of powerful simulations reconciles what seemed to be
fundamentally incompatible notions about probability.
It is also the case that solely relying on a long-run frequentist view can
be limited. It cannot be used for modeling the probability of non-repeatable
events -- because the idea of repeating the event over the long run is not
possible. For example, the success of a marketing campaign based on
probability of success (proportion buying) has to consider various factors
such as the product, the business plan, the competition, the state of the
economy, etc. The notion of holding things constant and rolling back time
repeatedly and then counting the event's number of occurrences to determine a
probability is not meaningful nor is it possible. What is possible is the use
of simulations that can vary these factors. In general, simulations are a
powerful approach to modeling complex problems because they can reveal
interesting and surprising outcomes.
Simulations can quickly reveal results that are would be very difficult to
derive from a purely theoretical (non-simulated) view. Parrando's paradox is
an example of this. Parrando's paradox points out the surprising pattern that
two games in which a player is guaranteed to lose can actually result in a
winning streak if played alternately. Through the use of simulations it was
discovered that if a player plays game A or game B 100 times, the player is
guaranteed to lose all money. But by alternating the games (or even randomly
switching between them) simulations routinely show that money accumulates into
big winnings. There must be a direct, but subtle dependence between two games
for this to work although the nature of this dependence is as yet not well
understood. Until this dependence is understood, it is extremely difficult to
determine a winning strategy without the use of simulations.
To conclude, in the context of data mining both the objectivist as well as
subjective notions of probability are routinely employed in the analytical
process to discover surprising but important patterns. The subjective view is
thus really an extension of the objective perspective. And as Parrando's
paradox illustrates, the use of simulations (popularized by subjective
approaches) can easily discover surprising patterns that point to a winning
strategy.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com.
|