Next Article Table of Contents Previous Article

ON CHANCE OCCURRENCES AND FALSE COINCIDENCES
by Ed Colet

Data mining technologies are designed to find potentially meaningful patterns buried deep within large amounts of data. In today's electronic age it is becoming easier to store more and more facts and information with each transaction. A consequence of this is that rare coincidences are likely to appear, and so how does one determine whether this rare event is truly meaningful?

When there are numerous attributes stored in large datasets, the probability of an association, correlation, or co-occurrence between any two attributes can be quite low assuming that they are independent. Because the probability of a particular event occurring is so low, it is thought that when this event (i.e. an association) does occur then it must be meaningful.

What is overlooked is that in a large dataset of numerous attributes, while the probability of any single specific event occurring is low, the probability that some events of this type will occur is actually quite high. In the process of investigating these seemingly interesting patterns the facts can be woven into a strikingly suggestive theory.

To illustrate, imagine that many facts about Presidents Lincoln and Kennedy were stored in a database. Then the following coincidences can be observed as noted by the mathematician John Allen Paulos: "Lincoln was elected President in 1860, Kennedy in 1960. Their names both consist of seven letters. Lincoln had a secretary named Kennedy and Kennedy had one named Lincoln. Lincoln and Kennedy were assassinated by John Wiles Booth and (allegedly) Lee Harvey Oswald respectively, men who went by three names and who advocated unpopular political positions. Booth shot Lincoln in a theater and fled to a warehouse; Oswald shot Kennedy from a warehouse and fled to a theater."

Is there anything truly significant about this observation? No (unless one subscribes to a conspiracy theory). So it is important that such apparent similarities be viewed as entertaining coincidences, rather than meaningful and important information. There are at least two approaches to facilitate this. First is that formal statistical analysis provides a certain amount of rigor in the analytical process that takes into account the probabilities of spurious correlations occurring as the datasets and/or number of attributes and facts increase -- and such spurious correlations usually fall below the threshold for statistical significance. The problem with this is that formal analysis sometimes also eliminates truly interesting patterns.

A second approach is to rely upon context brought to the table by a domain expert to determine whether a pattern is potentially meaningful. The knowledge of the domain expert is usually the key to knowing if a pattern is truly important or spurious. Algorithmically, a data-mining algorithm often treats many facts equally. But a domain expert can provide the necessary knowledge to determine which facts are important and meaningful. Returning to our example of the assassinated Presidents, the number of letters in a person's name, the names of the secretaries, etc. are not important -- i.e. do not have any intrinsic significance. Therefore, it is easy for a domain expert to interpret and see the above coincidence as just that -- a coincidence.

In conclusion, coincidences and patterns are likely to arise from large datasets. Statistical formalisms offer some protection from them reaching statistical significant. But it is also just as important for a domain expert to provide the expertise and contextual knowledge to determine which facts are important enough to be truly meaningful.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com.

Top of Page


Previous Article  |  Table of Contents  |  Next Article