MAKING SURPRISING EVENTS LESS SURPRISING
by Ed Colet
Every once in a while a turn of events surprises us. Hindsight being 20/20,
we are often able to come up with a way to account for the turn of events.
So although events may have been predictable, they surprise us because no
one had thought to predict the particular outcome. In this column, I discuss
the notion why certain predictable events aren't predicted; and how data
mining can help us better predict what was thought to be unpredictable.
Ultimately, surprising events are therefore less surprising.
Surprising events regularly occur across a variety of domains such as
politics, financial markets, sports, business practices, to name just a few.
The closeness of the popular vote in the state of Florida separating Gore
from Bush caught laypersons as well as political pundits by surprise. Even
George Bush and Al Gore along with their respective campaign staffs are
likely to have been surprised. In finance, the sky rocketing prices and
astronomical returns of technology stocks a year or two ago, followed by the
just as rapid rate of decline was a surprise. In sports, the upset win of
the US Hockey Team over Russia in the 1980 Winter Olympics was a surprise.
And the now famous data mining association pattern of sales between "beer
and diapers" is often cited as a surprising event.
Following the occurrence of each of these events were plausible accounts of
why the event occurred. Explanations have involved flawed voter ballots,
over-valued stocks in an Internet economy, the power of motivation by one
team coupled with the complacency by the other, and the effect of fatherhood
have all been cited post hoc as important factors in each of the above
events.
By definition, if these outcomes were predicted apriori by current modeling
approaches, and the predictions taken seriously, then these outcomes would
not be considered surprising. But if we can come up with plausible and
sophisticated accounts for the event, we could have predicted it -- but did
not. Why not? There are at least three reasons why a predictable event is
surprising.
One reason is that relevant data are not analyzed, or not available for
analysis. For example, one could argue that had the butterfly ballot used
in Florida been subjected to user testing, and the results of such
user-tests been analyzed, then the difficulties associated with the ballot
would have been apparent. Ultimately, an adverse effect on the distribution
of votes could have been avoided. Instead, we have an unprecedented and
possibly flawed count.
A second reason is that while some models may have predicted the outcome,
the prediction is not treated with a high degree of confidence and thus
ignored. For example, there must have been particular betting odds that the
US hockey team would beat the Russians in 1980. This implies that there
exists a model that predicted a US win. But few if any people were willing
to bet on those odds. In fact, US TV networks didn't even bother to show
the event live given the slim chance of a US win. The likelihood of a win
seemed to be outside expectations and beyond the realm of possibility. As
we know, we were then surprised by what actually occurred.
A third reason that traditional models do not predict surprising events is
because traditional techniques may not be sensitive or powerful enough for
accurate predictions. The pattern of sales of beer being linked to sales of
diapers is an example. Retailers have been analyzing sales of products for
years, yet it was only with a new analytical approach using a new analytical
algorithm (data mining with association rules) did surprising patterns
buried within the data of retail sales become apparent.
In contrast to data mining, traditional analytical techniques often require
certain assumptions and apply certain mathematical operations to the data.
For example, independent observations are often assumed by formal
statistical tests. Data of past activity is the fodder used to predict
future events. Yet the data of past behaviors might really be a sequence of
subtly related and inter-dependent events rather than the independent
observations that statistical tests assume. And the various smoothing
operations applied to data by analytical techniques to smooth out noise and
variance may unfortunately also be smoothing out the key to more accurate
predictions. As chaos theorists would argue, it is the random perturbations
and the details that are important.
Data mining as an analytical approach is not necessarily subject to these
limitations. First, scalability and the analysis of large data sets is a
defining characteristic of data mining. So all data, whether thought to be
relevant or not can be analyzed. Secondly, interesting patterns or
surprising predictions are exactly what are flagged in data mining reports,
encouraging further analysis and attention to what may appear to be an
unlikely outcome. Third, new algorithms that build on and extend
traditional statistical analysis can find and expose new and surprising
patterns. The effect of analyzing more data and drawing attention to
surprising results through the use of sophisticated algorithms leads to
better predictive models. Implementing data mining can make a surprising
turn of events not so surprising after all.
Ed Colet is the Acting Director of Research at Virtual Gold
Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com.
|