[ Table of Contents | NEXT ARTICLE ]

GOLDEN MEANS: INTERPRETING BIZARRE PATTERNS IN DATA MINING RESULTS
by Inderpal Bhandari, executive editor at large


I have often been asked what should be done when a data mining exercise leads to a really weird pattern. The following anecdote that has been making the rounds of e-mail has made it a lot easier for me to answer this question. I paraphrase the anecdote below.

A complaint was received by a division of a major automobile manufacturer: "I write to you again and I don't blame you for ignoring my first letter because it is an unusual situation I am in. Our family has a tradition, we eat ice cream for dessert every night. But the kind of ice cream varies, we vote every night on the flavor of ice cream we should have. It then falls upon me to drive to the store to get it.

"Recently, I purchased a new car from your company and since then my nightly dash for dessert has turned sour, albeit only once in a while. Specifically, every time I buy vanilla ice cream, my car does not start for the trip back home. If I get any other flavor, the car starts right away. What is it about your car and ice cream? I ask you in all earnestness."

A troubleshooter was dispatched to check this out. Convinced that he was on a wild goose chase, he was reassured to see that the complainant was very serious about his predicament. It was clearly not a frivolous complaint. After dinner, he decided to accompany the man to the ice cream store. The vote was for vanilla that night and, sure enough, after they completed their purchase and returned to the car, it did not start easily.

The engineer returned for three more nights. The first night, chocolate. The car started right up. The second night, strawberry. No problem again. The third night, vanilla. The car did not start up. The troubleshooter decided that the choice of ice cream could not be the problem. He began to take notes, jotting down all sorts of data, time of day, type of gas used, time to drive back and forth, etc.

He observed that it took considerably less time to buy vanilla than any other flavor. Vanilla, being the most popular flavor, was in a separate case at the front of the store to facilitate a quick pickup. All the other flavors were kept in the back of the store at a different counter where it took considerably longer to find the flavor and purchase it. Now the question became why the car wouldn't start if the visit was short.

Once the duration of visit became the problem -- not the vanilla ice cream -- the troubleshooter quickly came up with the answer: vapor lock. It was happening every night, but the extra time taken to get the other flavors allowed the engine to cool down sufficiently to start. When the flavor was vanilla, the engine was still too hot for the vapor lock to dissipate.

As is often the case with e-mail, I cannot profess to the accuracy of the above story. Such doubts notwithstanding, I am indebted to its author for it helps explain what should be done when weird patterns surface, as they sometimes do, during the course of a data mining exercise. The solution is to collect additional data and repeat the analysis.

Strange patterns that persist are often the result of hidden causes that are not directly captured by the data but are in fact being captured fortuitously by an attribute that serves as a surrogate. Thus, the flavor of ice cream could act as a surrogate for the time elapsed during the purchase of ice cream on account of the layout of the store.

However, in order to interpret a pattern, one relies on the meanings of the labels of attributes. A surrogate changes the meaning entirely, making the pattern weird, i.e., unintelligible and not amenable to interpretation. Collection of additional data, especially if it directly captures the hitherto-hidden cause, solves the problem.
---
Inderpal Bhandari can be reached via http://www.virtualgold.com


[ Table of Contents | NEXT ARTICLE ]