[ Table of Contents | NEXT ARTICLE ]

GOLDEN MEANS: LIES, DAMNED LIES AND STATISTICS
by Inderpal Bhandari, executive editor at large


Last week, I mentioned that in October 1997 my company, Virtual Gold, Inc., and I hosted a conference entitled "The Evolution of Data Mining: Technical Strategies to Beat your Competition by Year 2000". I then went on to describe some of the new directions that were fostered at the conference.

One direction that I forgot to mention is the mining of multimedia data, namely, image, audio and video data. In this article, I build up to a message from the conference on the relationship between data warehousing, digital libraries, and data mining. In our discussion below, data mining is used in the strict sense, namely, to refer to a process based on statistical analysis of data that leads one to some startling, counter-intuitive discovery of knowledge.

In his autobiography, Mark Twain remarked that there were three kinds of lies -- lies, damned lies and statistics. His reference to statistics, of course, underscores the point that misuse of such numbers can run the gamut -- they can be confounded and misinterpreted, or even manipulated and abused.

In traditional statistical analysis, we are usually concerned with proving or disproving some physical hypothesis, e.g., does smoking cause lung cancer? One would design a statistical experiment around such a hypothesis. And there could be parties with interests vested in the two different sides of the hypothesis. For example, presumably, clever statisticians can be found both amongst tobacco researchers and cancer researchers. You can see how a cat-and-mouse game could develop between opposing camps, which, in turn, would provoke the wit of a Mark Twain.

Data mining differs from traditional statistical analysis in that there really are no such hypotheses to begin with since one is really on a fishing expedition. However, if one does not know what one is going to find, then one certainly cannot know up front what data ought to be collected to bring out that finding.

This suggests that the patterns of a data mining exercise will likely only capture the gist of the knowledge to be discovered. To go from pattern to knowledge, one must put that pattern in a context in which it can be interpreted. For example, in my work with the coaches of the National Basketball Association, seldom is a pattern from data mining sufficient by itself to suggest to the coach what action he should take. The coach usually has to watch the video tape recording of game footage associated with the pattern to understand what should be done. The video recording is the context in which the pattern can be meaningfully interpreted.

And, that brings us to digital libraries. Contextual information is likely to be multimedia information since it is a record of physical events that have occurred. Such information is likely to reside in a digital library. On the other hand, data mining is done on a statistical data set, which will likely reside in a data warehouse. It follows then that there is a clear need to relate the digital library to the data warehouse, preferably at the time that these objects are designed.

Unfortunately, this is not the case today. You will not hear much about digital libraries in a data warehousing conference, and vice versa. The link that I have described above is not generally appreciated, even in obvious situations as described below.

Some time ago, I was talking with the chief analyst of a major market research organization. He mentioned the difficulty of interpreting patterns when attempting to mine survey data, such as that collected by a telephone survey. I asked him if they recorded the conversations during the survey. The answer was yes. But they had never thought to go back to those recordings to understand the significance of data mining patterns. In fact, there was no mechanism to link the audio data that contained the comments made by a respondent to the relational database that contained the multiple-choice answers chosen by that respondent.

The bottom line is that data mining connects the data warehouse to the digital library. If your organization has one department building a data warehouse and another department building a digital library, you may want to make sure that they understand this relationship -- now. Once their architectures are completed, it may be too late.

Interested in learning more about digital libraries, data warehouses and their role in data mining? Contact us at http://www.virtualgold.com


[ Table of Contents | NEXT ARTICLE ]