[ Table of Contents | NEXT ARTICLE ]

DATA MINING AS A USER-DIRECTED PROCESS
By Ed Colet


Data mining is often portrayed as the purely automated process of discovering patterns in large amounts of data. Automation is seen as the way to comprehensively handle large amounts of data in which there are simply too many variables, and too many records for a user to be able to "manually" find out which combinations are most important and informative. But one objection to data mining is that data analysis should not be "data-driven", but instead directed by questions and theories that are then tested using the data. Automated analysis, driven purely by the data can lead to the discovery of spurious effects due to chance that may be erroneously interpreted as meaningful. Another objection is that it shouldn't be necessary to analyze terabytes of data in order to find answers when sampling methods can be used to extract a smaller but representative portion of the data that can provide answers (although many data mining algorithms do rely on sampling). In this column, I'll elaborate on the fact that in reality, data mining is not really a data-driven and automated analysis. Much of it is a user-directed process.

In a data mining engagement there is a close collaboration between a domain expert, a quantitative analyst and a technology expert. The domain expert understands the business issues, the analyst's expertise may be in statistics, and the technology person understands databases and data access. Depending on the data mining implementation, these people can be from separate companies, or in-house within the same organization, and in some cases a single person embodies all three roles. It's common knowledge that approximately 80% of the resources in an engagement are typically devoted to data pre-processing and other activities prior to any mining or analysis. It is during the data pre-processing that much of the analysis is actually directed by user(s). Without careful collaborations by these parties during the pre-processing work, one is liable to wind up with a situation of "garbage in, garbage out".

What goes on in pre-processing is generally an understanding of the business issues as they pertain to the data and it's characteristics. The business issue is the general problem to be solved (e.g. reduce churn, enhance revenue, detect fraud, etc.). What types of data mining algorithms are applicable to solve these issues, and what data are needed/available? Understanding the data and their characteristics involves an understanding of what fields and columns are in the data set; what the values for these fields are and how these values are distributed; whether null values are present, and how to handle missing data are addressed; the relationship among sets of fields in the data is learned - e.g. certain fields may be derived measures of other fields and therefore systematically dependent on other fields; continuously distributed fields may be categorized and/or new fields to act as indicator values (0's and 1's) may need to be added. All of these are typical issues addressed during pre-processing and involve a close collaboration between the parties.

Why is pre-processing so labor intensive, and can it be streamlined without sacrificing the benefits gained? It is labor intensive because the domain that the data is drawn from can be complex. For example, the ups and downs in the financial industry may not be well understood even by those who've spent entire careers analyzing these patterns. Data elements can be complicated - a single field such as a "Price/Earnings ratio" is a remarkably succint representation of a substantial amount of domain knowledge.

Can this labor-intensive process be streamlined and made more efficient? Probably. We already have standards to handle the display of data. At a basic level there's HTML, and more comprehensively and more recently there's a proposal by Oracle for the use of "portlets" to define the way business information from several sources can be combined and displayed by computer systems. We even have developing technology standards for understanding the content of data (based on XML). It's possible that these can be useful in streamlining the pre-analytical process of understanding a data set before mining.

To conclude, automated data mining is only a part of a comprehensive set of analysis that should include traditional statistical analysis (to verify discovered patterns) as well as database querying (to ask for specific results). A comprehensive view of data mining sees it not simply as an automated analysis of large data sets, but as a user-directed process of applying technology to solve a business issue.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]