ON EXPECTING THE RARELY OCCURRING EVENT
by Ed Colet
Data mining tools and technologies are designed to find meaningful patterns
hidden in large amounts of data. In order for a pattern to be considered
"meaningful" or "interesting" it is usually evaluated in terms of some
standard or expectation -- such as a mathematically computed average, or an
expected probability. Most of us are familiar with these approaches. In
this column, I discuss an alternative and intriguing notion of using an
"average" as the standard for comparison. In this latter approach the
"average" represents an event that will rarely if ever occur.
Many data mining algorithms currently incorporate statistically based
approaches for analyzing numeric data. This often involves computing a
measure of central tendency in the data. This measure could be the
arithmetic average or mean. Depending on the nature of the distribution of
values in the data, another measure of central tendency might be used such
as the median or mode. The measure is then used as a standard and patterns
found in the data are compared to this standard. In terms of a deviation
detection analysis, patterns that deviate from this standard are marked as
interesting. For example, if the average duration of a cellular phone call
made by a legitimate subscriber is known to be 1 minute and 42 seconds, then
calls that are very different from this average (e.g. 12 minutes) may
indicate fraudulent calls, and are marked as interesting.
Another statistically based method that is commonly used for comparing
patterns against a standard is to define the standard as the likelihood of
an event occurring by chance alone. If a pattern that is hidden in the
data is determined to have a very low likelihood of occurring simply by
chance, then it is marked as interesting. For example, the chance
probability that a visitor will place an order on a retailer's web site
might be determined to be once out of every six visits. Imagine that it is
discovered that some visitors make a purchase five times out of every six
visits. The likelihood of this occurring simply by chance alone is
extremely low (0.0006). Because the probability of this event occurring
simply by chance is so low, this pattern is marked as statistically
significant and interesting.
In contrast to the above approaches, a recent situation occurred that
involved an intriguing way of defining an "average" as a standard for
comparison. The task was to develop a scoring and analysis system for a
particular aspect of a person's golf game. Driving a golf ball involves the
ability to hit it as far as one can down a fairway towards the green and the
hole. No golfer seriously expects it to go into the hole from a drive, but
does have expectations about the distance and location of where he expects
the ball to land. It was proposed that drives be scored in relation to the
golfer's expectation using certain criteria (short, long, left, right) and
whether the golfer hit it better, worse or as expected. The expectation in
this case is the golfer's average drive. It turns out that an "average"
drive for a golfer is really an "ideal" drive -- one where the ball went as
expected. But for the most part, this rarely happens because golfers
usually don't hit the ball and have it land where they wanted it to. Asking
a golfer about their average drive is usually answered in terms of an ideal
drive. (Thus, most golfers may have a mistaken interpretation of average
when applied to their golf game?). Nevertheless, the use of an "average",
but one that rarely occurs can be used as a valid and useful standard for
comparison.
It turns out that this isn't an anomaly associated with some eccentricities
of golf, but also occurs in certain day-to-day routines as well. To
illustrate, there is a friend that always selects duck whenever it's
available on the menu. This person is rarely satisfied with the meal,
feeling that the duck was either overcooked, undercooked, over seasoned,
under seasoned or having various other faults. There has never been a
satisfactory duck meal. In this case, the expectation is akin to the
Platonic Ideal for duck. What's interesting is that this person can vividly
recall the characteristics of every flawed duck meal from each restaurant.
Deviations from his standard remain memorable. As in golf, the standard of
comparison or the expectation is one that rarely occurs.
To return things to the context of data mining, patterns are interesting
relative to a standard for comparison. A statistical or mathematical basis
for defining a standard is what is commonly implemented. But relying solely
on statistical significance to define interesting patterns can result in
patterns that the user may not find useful in any practical way (e.g. it may
represent information that the user already knows). In other situations,
what an end-user may want to define as interesting and useful are patterns
that deviate from expectations that in actuality rarely occur. If one takes
this to it's logical conclusion, every occurrence will therefore be a
deviation from an expectation, and thus all occurrences become memorable or
interesting. What's necessary is a way for data mining tools to incorporate
both the statistical as well as user-defined notions of an expectation --
and that patterns that are presented to the user are truly interesting and
useful.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com
|