ON CHANCE OCCURRENCES AND FALSE COINCIDENCES
by Ed Colet
Data mining technologies are designed to find potentially meaningful patterns
buried deep within large amounts of data. In today's electronic age it is
becoming easier to store more and more facts and information with each
transaction. A consequence of this is that rare coincidences are likely to
appear, and so how does one determine whether this rare event is truly
meaningful?
When there are numerous attributes stored in large datasets, the probability
of an association, correlation, or co-occurrence between any two attributes
can be quite low assuming that they are independent. Because the probability
of a particular event occurring is so low, it is thought that when this event
(i.e. an association) does occur then it must be meaningful.
What is overlooked is that in a large dataset of numerous attributes, while
the probability of any single specific event occurring is low, the probability
that some events of this type will occur is actually quite high. In the
process of investigating these seemingly interesting patterns the facts can be
woven into a strikingly suggestive theory.
To illustrate, imagine that many facts about Presidents Lincoln and Kennedy
were stored in a database. Then the following coincidences can be observed as
noted by the mathematician John Allen Paulos: "Lincoln was elected President
in 1860, Kennedy in 1960. Their names both consist of seven letters. Lincoln
had a secretary named Kennedy and Kennedy had one named Lincoln. Lincoln and
Kennedy were assassinated by John Wiles Booth and (allegedly) Lee Harvey
Oswald respectively, men who went by three names and who advocated unpopular
political positions. Booth shot Lincoln in a theater and fled to a warehouse;
Oswald shot Kennedy from a warehouse and fled to a theater."
Is there anything truly significant about this observation? No (unless one
subscribes to a conspiracy theory). So it is important that such apparent
similarities be viewed as entertaining coincidences, rather than meaningful
and important information. There are at least two approaches to facilitate
this. First is that formal statistical analysis provides a certain amount of
rigor in the analytical process that takes into account the probabilities of
spurious correlations occurring as the datasets and/or number of attributes
and facts increase -- and such spurious correlations usually fall below the
threshold for statistical significance. The problem with this is that formal
analysis sometimes also eliminates truly interesting patterns.
A second approach is to rely upon context brought to the table by a domain
expert to determine whether a pattern is potentially meaningful. The
knowledge of the domain expert is usually the key to knowing if a pattern is
truly important or spurious. Algorithmically, a data-mining algorithm often
treats many facts equally. But a domain expert can provide the necessary
knowledge to determine which facts are important and meaningful. Returning to
our example of the assassinated Presidents, the number of letters in a
person's name, the names of the secretaries, etc. are not important -- i.e. do
not have any intrinsic significance. Therefore, it is easy for a domain
expert to interpret and see the above coincidence as just that -- a
coincidence.
In conclusion, coincidences and patterns are likely to arise from large
datasets. Statistical formalisms offer some protection from them reaching
statistical significant. But it is also just as important for a domain expert
to provide the expertise and contextual knowledge to determine which facts are
important enough to be truly meaningful.
Ed Colet is the Acting Director of Research at Virtual Gold
Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com.
|