[ PREVIOUS ARTICLE | Table of Contents | NEXT ARTICLE ]

DATA MINING...OR JUST PICKING UP ROCKS
by Thomas "Tony" Rathburn


INTRODUCTION

For decades, business has recognized the potential of the vast quantities of data they collect. Transaction processing systems hold the potential to reveal dramatic improvements in the way businesses operate. The methods employed to extract this information has been diverse.

Early techniques relied largely on statistical techniques to capture basic relationships and descriptive facts. As analysts, and technology, became more sophisticated, other tools were tested. Today, a wide variety of tools and techniques exist for the extraction of information content from enterprise data. Managing this effort requires a diverse set of skills ranging from project management to technical expertise to domain specific knowledge.

There is no magic in this effort. The concept of artificial intelligence has not lived up to its hype. The reality is that the search for information is specific to the goals of the project. It is unrealistic to expect any technology to conceive the problem in a manner consistent with any specific organization. It is realistic to develop a tool-box of techniques to analyze data for the development of improvements in decision making processes.

No one technology provides all the answers. The organizations achieving the best results understand the strengths and weaknesses of the technologies they employ, and they understand the problem they are working on. They integrate their tools to achieve incremental improvements in specific applications. They are driven by goals. They are not walking around picking up rocks to see if any one might have value to them.

INFORMATION EVOLUTION

A history of industrial evolution might include significant events like the discovery of fire, the invention of the wheel, the first printing press, the cotton gin, and the development of the assembly line. This evolution would also have to include the conceptual development necessary to move from tradesmen operating on an individual basis to multinational organizations employing the specialized skills of hundreds of thousands of people in a coordinated manner.

The information evolution can also be examined in this manner. The development of technology like the computer, the personal computer, mass data storage devices, the modem and communications devices, networking technology, and the Internet will undoubtedly be seen as significant from a historical perspective.

We are also witnessing the conceptual development of information processing. We have moved from centralized control of large transaction based systems to PC based reporting and analysis systems. Recently, we have seen the integration of these individual efforts into networked systems. As organizations reinforce the value of these individual and small group efforts, they are once again asserting control over the information assets of the organization in legacy systems and data warehousing efforts. Almost simultaneously, however, the limitations of these massive systems are being offset by decentralization efforts, such as data marts.

What is apparent from watching the development of information processing is that many technologies are being brought to bear on a problem of great significance. Conceptually, we are attempting to apply our approaches from experience to this new way of thinking. The techniques that are applicable are being kept. Those found wanting are being discarded. What is not apparent to many people directly involved in this process is that there is not one right answer to the questions being posed. In fact there are many, each alternative with its own performance level.

It is important to recognize that the information processing systems being developed are intended to support human decision systems. There are no predetermined goals and objectives, other than those instituted by the sponsors of the project. There are few causal reasons for the decisions being supported.

In most cases we are attempting to predict or classify other human behaviors, based on past experience. We need to recognize the inconsistencies between people, and by the same people at different points in time. We need to accept the idea of probabilistically correct decision making. We can not expect that every individual decision will be correct any more than a casino operator expects to win every hand of cards or roll of the dice.

If the data analyst is developing a mathematical model, he is acting very much like the customer of a casino who has never gambled before. By observing the behavior of others and the outcomes that result, the data analyst attempts to develop a set of rules that will result in success.

Many extremely talented people have set out to improve on the decision making of their group, or even of their organization. The most common single cause of failure for these people is a lack of a clear definition of both the game they are playing and of winning. Imagine the analyst who has developed a perfectly good model of winning at black jack suddenly walking up to a table playing five-card stud.

Our expectations of technology are often unrealistic to the point that we expect grand solutions to significant problems with little effort. We expect software to understand our problem and what we deem to be important in its solution. We then expect the software to look at our data, in whatever form we may have collected it, and distill a magic solution. In most cases, we would have a better chance of success by searching for a bottle with a genie in it.

In modeling human behaviors for decision making, we need to begin by clearly defining the parameters under which we will operate, the constraints that will be placed on use, current performance levels and the performance metrics used to define success.

IMPROVING PERFORMANCE

Our goal is improving performance, however we define performance. Each individual, and each organization, has their own definition. It should be a priority of any project to begin with a clear definition of performance. Far too many projects begin with collecting and examining data to see "what we can find out." Can you imagine a gold mining operation ordering a load of gravel from the local distributor on the off chance of finding some gold in it?

Ultimately, we are seeking better decision making for a particular problem. We are attempting to modify our behavior when faced with a set of circumstances. The characteristics of others, and our expectations of their behaviors, define the circumstances we face in making a decision. Goals and performance metrics need to be specific and measurable. "Make more money" is not acceptable. It is simply too general. Does it imply that we can use any and all means to achieve this simple goal. Can we change the product, modify any system, endure any hardship? Or, are there other parameters and constraints?

Do we have any experience with this problem, or are we starting from scratch? Do we have a decision making process with well defined parameters in place? Are we seeking incremental improvement to an existing process?

We accept that we are attempting to improve our performance, either individually or organizationally, by enhancing our understanding of the problem. We have developed good, complete definitions of our problem, our current status, and of success. Now, what can we do to reach success?

THE DATA MINING ENVIRONMENT

The development of better decision making models has fallen under many titles. Fifteen years ago, the author did not realize that he was doing "data mining" or "knowledge discovery." At the time, these efforts were called "exploratory data analysis" by statisticians, and "knowledge engineering" by the practitioners of expert systems.

Whatever the title placed on these efforts today, many technologies and techniques can be introduced to the effort. Each technology, and each specific implementation of that technology, has its own set of strengths and weaknesses. The key to high levels of success lies in understanding how the strengths of one technology can offset the weaknesses of another, and then implementing a solution that integrates the strengths of many technologies to improve the problem YOU are working on.

This article offers five main segments to the data mining environment:

Each segment of the data mining environment offers many alternatives. And, each raises a number of issues for the practitioner to address. While it is obviously beyond the scope of this article to address all of these segments, or even one of them, in any detail, it is important for the practitioner to recognize the interaction between these components of the data mining environment.

Each of the segments can be decomposed into a number of technologies. Within each of the technologies, a number of alternative solutions exist. A brief listing of technologies applicable to the Data Analysis and Modeling Segment is listed below as an example of the breadth available.

DATA ANALYSIS AND MODELING
DATA VISUALIZATION
STATISTICS

      Statistics - General
      CART
      CHAID
      Factor Analysis
      K Nearest Neighbor
      Logistic Regression
      MARS
      Optimization
      Principal Components Analysis
      Regression
   ADVANCED TECHNOLOGIES
      Advanced Technology - General
      Case Based Reasoning
      Decision Trees
      Fuzzy Logic
      Genetic Algorithms
      Neural Nets
      Non-linear Dynamical Systems
      Rough Sets
      Rule Based Systems

CONCLUSION
This article attempts to place the emphasis of data mining where it belongs: on improving performance by clearly identifying the goals of the project, and recognizing the myriad of tools and techniques available for that purpose. The practitioner who can integrate these tools effectively can reach well beyond the level of any one tool used independently. The practitioner is also well advised to consider the overall environment in which the data mining effort takes place.


[ PREVIOUS ARTICLE | Table of Contents | NEXT ARTICLE ]