Next Article Table of Contents Previous Article

HOW MUCH SHOULD DATA MINING TECHNOLOGY AUTOMATE?
by Ed Colet

Data mining is the ability to automatically discover meaningful patterns hidden in large amounts of data. The process of moving from the initial collection of raw data to the strategic decisions can be viewed as all being within the scope of data mining technology. In this column, I outline how the technologies of data mining have influenced aspects of both data storage and data querying/analysis in positive ways. Because decision-making follows analysis, I also discuss whether automated aspects of data mining can or should be extended to the decision-making process as well.

Data mining has influenced various aspects of data storage. Large data sets are the norm for data mining technologies. In the past, when data sets were relatively small, storage and analysis could be managed quite well "by hand" and there was little need for the automated techniques associated with modern data mining. Today's large data sets are the result of more affordable products and easier techniques for collecting and storing data. As more and more data became easier to collect and store, a natural consequence was the desire to analyze this data. Data mining offered the ability to comprehensively analyze large amounts of data using powerful algorithms. But it also made clear that analyzing clean data was important, and data mining influenced many techniques for pre-processing and cleaning data. Today, much data are now stored in data marts and data-warehouses -- repositories that are designed to store "clean" data well suited for analysis, whether the analysis be simple querying or sophisticated mining.

Data mining has extended the scope of what was feasibly possible to do in terms of querying and analyzing data. Verification-driven analysis is characterized by an end-user formulating specific hypotheses and issuing queries that are then verified using the data. Data mining is characterized as a discovery-driven rather than a verification-driven process. Rather than requiring an end-user to formulate hypotheses or issue queries, data mining provides algorithms that can automatically and comprehensively explore the data set and extract patterns that the end-user may not have known to ask about. Therefore, the discovery of new and interesting information is facilitated. It would be short sighted to infer that discovery-driven analysis is better than or should replace verification-driven approaches, or the related notion that data mining software should therefore replace data analysts. A better conclusion is that data mining technologies can improve analysts' ability to investigate hypotheses, and refine their models.

Data mining can be thought of as a tool that improves the analysts' capabilities. The scope of data mining and knowledge discovery extends beyond the storage and analysis of data, but also to the process of interpretation and decision-making. Many data mining products offer features that are designed to facilitate the end-user's interpretation of patterns that were discovered in the data via the mining algorithms. Visualization techniques ranging from simple graphics (e.g. pie-charts, bar-graphs, scatter plots, etc) to more complex views (parallel-coordinate windows, fractal foam plots, animations, etc) are all available for presenting and viewing results. An alternative to the use of visualization, is the use of text and natural language descriptions to describe patterns -- explicitly pointing out that patterns are interesting due to their deviations from a standard measure for comparison, and/or their probability value(s).

Following the discovery and reporting of patterns are decisions and actions. This has been the exclusive purview of the end-user decision-maker, drawing upon his/her extensive domain knowledge. Although decision making is related to data mining, the automated aspects of data mining technologies have not really extended to the decision- making process. This remains true despite the fact that the success and value of a data mining application solution often depends on the quality of decisions that are made. For example, if two evenly matched basketball teams are both using "Advanced Scout" data mining software, they will both be mining the same game data, and will therefore both discover the same patterns. The difference can boil down to which coaching staff makes the better decisions and comes up with the better strategy to win the game. When we developed this software, it was recognized that it was unlikely that our technology could have done a better job at complex decision-making than a skilled domain expert (e.g. the experienced NBA coach). Thus, the software was explicitly designed to only extract patterns, and make it easier for a coach to interpret them. Decision-making was up to him.

Automatically executing decisions and actions based on discovered patterns may be useful in constrained application-specific domains. Depending on the number or type of valid actions/decisions, such an application can be successful developed. Chess is one domain in which we've seen an application (Deep Blue) that automatically went from patterns (positions on a board), to selecting the best action (a move); and such an application has performed well at the highest level. Also, in application domains in which certain actions can be characterized as "routine" or "low-level", then such actions can also be automatically initiated following data mining analyses. For example, data mining of supermarket purchases may find a pattern where purchases of product X are associated with purchases of product Y. A routine decision would be to check and order inventory for both product X and Y. This can be automatically initiated by the system based on the pattern, rather than relying on the store manager to decide to check or order inventory. (This assumes that the decisions about inventory levels or product orders are "routine"). More complex decisions such as changing the price of either or both items, or to run an advertising campaign on the product(s) are more complex decisions that are perhaps best left to the store manager or domain expert.

Although it is possible to implement a system that can map discovered patterns that trigger particular actions/decisions, for the most part, it is not likely to be a better (or desired) approach than the current model of leaving decision-making to a skilled domain expert. There are several reasons for this. First, the very nature of data mining patterns implies that patterns are interesting because they are infrequent, and/or hidden within the data (if not, then the patterns would most likely already be known, and thus neither new or interesting). Thus it would be difficult to create (ahead of time) a useful mapping from few patterns to appropriate actions, because the patterns are (by definition) not known. Secondly, even if the few patterns are known, it is always likely that there are many possible actions to take. Automatically evaluating what the "best" action to take for a particular pattern can be difficult. Until a computer system can do better at creatively arriving at the right decision and action, it is an area better left to human intervention rather than automated processing.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com

Top of Page


Previous Article  |  Table of Contents  |  Next Article