Next Article Table of Contents Previous Article

ON EXPECTING THE RARELY OCCURRING EVENT
by Ed Colet

Data mining tools and technologies are designed to find meaningful patterns hidden in large amounts of data. In order for a pattern to be considered "meaningful" or "interesting" it is usually evaluated in terms of some standard or expectation -- such as a mathematically computed average, or an expected probability. Most of us are familiar with these approaches. In this column, I discuss an alternative and intriguing notion of using an "average" as the standard for comparison. In this latter approach the "average" represents an event that will rarely if ever occur.

Many data mining algorithms currently incorporate statistically based approaches for analyzing numeric data. This often involves computing a measure of central tendency in the data. This measure could be the arithmetic average or mean. Depending on the nature of the distribution of values in the data, another measure of central tendency might be used such as the median or mode. The measure is then used as a standard and patterns found in the data are compared to this standard. In terms of a deviation detection analysis, patterns that deviate from this standard are marked as interesting. For example, if the average duration of a cellular phone call made by a legitimate subscriber is known to be 1 minute and 42 seconds, then calls that are very different from this average (e.g. 12 minutes) may indicate fraudulent calls, and are marked as interesting.

Another statistically based method that is commonly used for comparing patterns against a standard is to define the standard as the likelihood of an event occurring by chance alone. If a pattern that is hidden in the data is determined to have a very low likelihood of occurring simply by chance, then it is marked as interesting. For example, the chance probability that a visitor will place an order on a retailer's web site might be determined to be once out of every six visits. Imagine that it is discovered that some visitors make a purchase five times out of every six visits. The likelihood of this occurring simply by chance alone is extremely low (0.0006). Because the probability of this event occurring simply by chance is so low, this pattern is marked as statistically significant and interesting.

In contrast to the above approaches, a recent situation occurred that involved an intriguing way of defining an "average" as a standard for comparison. The task was to develop a scoring and analysis system for a particular aspect of a person's golf game. Driving a golf ball involves the ability to hit it as far as one can down a fairway towards the green and the hole. No golfer seriously expects it to go into the hole from a drive, but does have expectations about the distance and location of where he expects the ball to land. It was proposed that drives be scored in relation to the golfer's expectation using certain criteria (short, long, left, right) and whether the golfer hit it better, worse or as expected. The expectation in this case is the golfer's average drive. It turns out that an "average" drive for a golfer is really an "ideal" drive -- one where the ball went as expected. But for the most part, this rarely happens because golfers usually don't hit the ball and have it land where they wanted it to. Asking a golfer about their average drive is usually answered in terms of an ideal drive. (Thus, most golfers may have a mistaken interpretation of average when applied to their golf game?). Nevertheless, the use of an "average", but one that rarely occurs can be used as a valid and useful standard for comparison.

It turns out that this isn't an anomaly associated with some eccentricities of golf, but also occurs in certain day-to-day routines as well. To illustrate, there is a friend that always selects duck whenever it's available on the menu. This person is rarely satisfied with the meal, feeling that the duck was either overcooked, undercooked, over seasoned, under seasoned or having various other faults. There has never been a satisfactory duck meal. In this case, the expectation is akin to the Platonic Ideal for duck. What's interesting is that this person can vividly recall the characteristics of every flawed duck meal from each restaurant. Deviations from his standard remain memorable. As in golf, the standard of comparison or the expectation is one that rarely occurs.

To return things to the context of data mining, patterns are interesting relative to a standard for comparison. A statistical or mathematical basis for defining a standard is what is commonly implemented. But relying solely on statistical significance to define interesting patterns can result in patterns that the user may not find useful in any practical way (e.g. it may represent information that the user already knows). In other situations, what an end-user may want to define as interesting and useful are patterns that deviate from expectations that in actuality rarely occur. If one takes this to it's logical conclusion, every occurrence will therefore be a deviation from an expectation, and thus all occurrences become memorable or interesting. What's necessary is a way for data mining tools to incorporate both the statistical as well as user-defined notions of an expectation -- and that patterns that are presented to the user are truly interesting and useful.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com

Top of Page


Previous Article  |  Table of Contents  |  Next Article