[ Table of Contents | NEXT ARTICLE ]

GETTING DIFFERENT ANSWERS TO THE SAME BUSINESS QUESTIONS - PART I
By Ed Colet


In a data mining engagement, a business organization will formulate a business question of strategic importance that it hopes can be resolved by finding answers buried within its databases. Opening the databases to analysis will result in the discovery of many different potential answers -- not all of which are consistent or agree with each other. In part I of this series, I address how and why different analytical approaches on the same data can yield different results. Part II will appear next week and address how the business user can decide which of the many discovered results are likely to be truly useful, and thus provide the highest return on the investment from data analysis.

Different data analytic approaches can result in different results. Two very general approaches that are commonly used are (1) conventional statistical data analysis and (2) newer data mining technologies.

The conventional inferential statistical approach can be characterized in the following way. Given the business question and the characteristics of the data that are available for analysis, then certain statistical analyses are readily known to be appropriate and valid. The results of the analysis are then interpreted to resolve the business question, and the appropriate business actions are then taken.

For example, a business question may be to predict whether a person will cancel or retain his/her insurance policy. A multiple logistic regression analysis is an appropriate way to predict a cancel/non-cancel prediction from several other measured attributes (that are interval or categorical). The final logistic regression model with the coefficients can then shed light on what attributes are predictive of canceling or renewing a policy. More specifically, a particular hypothesis that cancellations vs. renewals differ with respect to the level of another attribute in the data (e.g. number of claims) can be explicitly tested using well-established statistical methods (i.e., a T-test).

The assumptions behind the use of statistical techniques are well known because they are based on the mathematics of probability theory and the study of sampling distributions. Despite this, subtle patterns in the data can still be missed. The use of a statistical technique can be judged to be appropriate or not on the basis of both (a) whether underlying assumptions are adhered to and (b) whether decisions made in the course of analysis which can affect the results are justified (e.g. changing the criteria for entry of independent variables into the model to a non-conventional value). The process is well formed. But because the process is so well formed (constrained?), it's possible that interesting trends may be missed. - either through a misguided hypothesis being formed and/or attributes in the data being excluded from the model. Eventually, diagnostic procedures can determine whether the statistical model fits well, but provide little guidance in telling you what subtle patterns may have been missed.

A characteristic of data mining technology is the newer analytical techniques that have been developed and invented recently. But often, it's not entirely clear (to the same degree of certainty that exists in the statistical community) what data mining technique to apply for a specific business question. And given the complexity of some business issues, "tried and true" statistical methods may not be obvious either. There does exist general techniques that can be applied to general problems -- classifying high vs. low spenders is different from finding associations among items that these high vs. low spenders buy. But even within a classification task, there are several different approaches that may result in different results. An unfortunate consequence is that a neural network classifier may not classify records in the same way that a decision tree classifier would.

Unlike the conventional statistical approach, data mining algorithms are optimized to analyze large data sets - and thus subtle trends are more likely to become apparent. But as we've seen, it's possible that the subtle patterns output by one data mining method are different from the subtle patterns output by another. So, our business user is faced with different answers for resolving his business issue. Next week I'll talk about how this dilemma can be solved.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]