Next Article Table of Contents Previous Article

Analysis & Commentary:

CHALLENGES FOR DATA MINERS, STATISTICIANS AND CLIENTS WHO DEPEND UPON AND FUND THEM
by Arnold Goodman, UCI Center for Statistical Consulting

Our world is increasingly overwhelmed by the massive amounts of complex data begging for an effective method to be transformed into interpretable knowledge. Discovering such knowledge from the data requires the informative patterns mined from data to be generalized into predictive models (sufficient for most business or practical purposes) that first suggest knowledge and then facilitate its acceptance in the world beyond this data (a goal for important or scientific purposes).

Uncertainties inherent in both the collection and processing of data demand to be accounted for in prediction confidence intervals, hypothesis test levels and serious knowledge evaluations. Data mining and statistics will inevitably grow toward each other in the next ten years because data mining will not become knowledge discovery without statistical thinking, statistics will not be able to succeed on massive and complex datasets without data mining approaches, and both will be driven by clients who own the data and compel them to work together in knowledge discovery.

A basic challenge is for the results to work almost all (not only some) of the time, account for uncertainty outside (not only inside) the data, and add valuable knowledge to a client's world.

Data miners tend to be ignorant of statistics and client's domain, statisticians tend to be ignorant of data mining and client's domain, and clients tend to be ignorant of data mining and statistics. Unfortunately, they also tend to be inhibited by myopic points of view: computer scientists focus upon database manipulations and processing algorithms, statisticians focus upon identifying and handling uncertainties, and clients focus upon integrating knowledge into the knowledge domain.

Knowledge discovery rests on the three balanced legs of computer science, statistics and client knowledge: it will not stand either on one leg or on two legs, or even on three unbalanced legs. Successful knowledge discovery needs a substantial commitment to collaboration from all three.

A maturity challenge is for data miners, statisticians and clients to recognize their dependence on each other and for all of them to widen their focus until true collaboration becomes reality.

Lack of critical knowledge is not excused by insufficient time in an outdated curriculum or any difficulties in acquiring it through personal efforts. Data miners, statisticians and scientists can overcome these difficulties by reading "Principles of Data Mining" and "The Elements of Statistical Learning" (GPS: see www.kdnuggets.com/publications/books.html), attending Interface Symposia, and attracting that critical knowledge to their meetings.

A leadership challenge is to create an atmosphere appreciating critical external knowledge, and to place those who possess it at the core of its conferences rather than in the periphery.

"The Brain Makers" (by HP Newquist) and "Mind Matters" (by James P. Hogan) document artificial intelligence going from over-promising in the early 1960's to under-performing in the early 1970's and expert systems going from over-promising in the early 1980's to under-performing in the early 1990's, despite vestiges remaining when computer technology is finally able to keep the promises of computer science. Is machine learning under-promising and over-performing because it has learned its lesson from this history?

The critical challenge for us all is to view the challenges as opportunities for our joint success.

Arnold Goodman is Associate Director, UCI Center for Statistical Consulting.

courtesy of KDnuggets

Top of Page


Previous Article  |  Table of Contents  |  Next Article