HOW MUCH SHOULD DATA MINING TECHNOLOGY AUTOMATE?
by Ed Colet
Data mining is the ability to automatically discover meaningful patterns
hidden in large amounts of data. The process of moving from the initial
collection of raw data to the strategic decisions can be viewed as all being
within the scope of data mining technology. In this column, I outline how
the technologies of data mining have influenced aspects of both data storage
and data querying/analysis in positive ways. Because decision-making follows
analysis, I also discuss whether automated aspects of data mining can or
should be extended to the decision-making process as well.
Data mining has influenced various aspects of data storage. Large data sets
are the norm for data mining technologies. In the past, when data sets were
relatively small, storage and analysis could be managed quite well "by hand"
and there was little need for the automated techniques associated with
modern data mining. Today's large data sets are the result of more
affordable products and easier techniques for collecting and storing data.
As more and more data became easier to collect and store, a natural
consequence was the desire to analyze this data. Data mining offered the
ability to comprehensively analyze large amounts of data using powerful
algorithms. But it also made clear that analyzing clean data was important,
and data mining influenced many techniques for pre-processing and cleaning
data. Today, much data are now stored in data marts and data-warehouses --
repositories that are designed to store "clean" data well suited for
analysis, whether the analysis be simple querying or sophisticated mining.
Data mining has extended the scope of what was feasibly possible to do in
terms of querying and analyzing data. Verification-driven analysis is
characterized by an end-user formulating specific hypotheses and issuing
queries that are then verified using the data. Data mining is characterized
as a discovery-driven rather than a verification-driven process. Rather than
requiring an end-user to formulate hypotheses or issue queries, data mining
provides algorithms that can automatically and comprehensively explore the
data set and extract patterns that the end-user may not have known to ask
about. Therefore, the discovery of new and interesting information is
facilitated. It would be short sighted to infer that discovery-driven
analysis is better than or should replace verification-driven approaches, or
the related notion that data mining software should therefore replace data
analysts. A better conclusion is that data mining technologies can improve
analysts' ability to investigate hypotheses, and refine their models.
Data mining can be thought of as a tool that improves the analysts'
capabilities. The scope of data mining and knowledge discovery extends beyond
the storage and analysis of data, but also to the process of interpretation
and decision-making. Many data mining products offer features that are
designed to facilitate the end-user's interpretation of patterns that were
discovered in the data via the mining algorithms. Visualization techniques
ranging from simple graphics (e.g. pie-charts, bar-graphs, scatter plots, etc)
to more complex views (parallel-coordinate windows, fractal foam plots,
animations, etc) are all available for presenting and viewing results. An
alternative to the use of visualization, is the use of text and natural
language descriptions to describe patterns -- explicitly pointing out that
patterns are interesting due to their deviations from a standard measure for
comparison, and/or their probability value(s).
Following the discovery and reporting of patterns are decisions and actions.
This has been the exclusive purview of the end-user decision-maker, drawing
upon his/her extensive domain knowledge. Although decision making is related
to data mining, the automated aspects of data mining technologies have not
really extended to the decision- making process. This remains true despite
the fact that the success and value of a data mining application solution
often depends on the quality of decisions that are made. For example, if two
evenly matched basketball teams are both using "Advanced Scout" data mining
software, they will both be mining the same game data, and will therefore
both discover the same patterns. The difference can boil down to which
coaching staff makes the better decisions and comes up with the better
strategy to win the game. When we developed this software, it was recognized
that it was unlikely that our technology could have done a better job at
complex decision-making than a skilled domain expert (e.g. the experienced
NBA coach). Thus, the software was explicitly designed to only extract
patterns, and make it easier for a coach to interpret them. Decision-making
was up to him.
Automatically executing decisions and actions based on discovered patterns
may be useful in constrained application-specific domains. Depending on the
number or type of valid actions/decisions, such an application can be
successful developed. Chess is one domain in which we've seen an application
(Deep Blue) that automatically went from patterns (positions on a board), to
selecting the best action (a move); and such an application has performed
well at the highest level. Also, in application domains in which certain
actions can be characterized as "routine" or "low-level", then such actions
can also be automatically initiated following data mining analyses. For
example, data mining of supermarket purchases may find a pattern where
purchases of product X are associated with purchases of product Y. A routine
decision would be to check and order inventory for both product X and Y.
This can be automatically initiated by the system based on the pattern,
rather than relying on the store manager to decide to check or order
inventory. (This assumes that the decisions about inventory levels or
product orders are "routine"). More complex decisions such as changing the
price of either or both items, or to run an advertising campaign on the
product(s) are more complex decisions that are perhaps best left to the
store manager or domain expert.
Although it is possible to implement a system that can map discovered
patterns that trigger particular actions/decisions, for the most part, it is
not likely to be a better (or desired) approach than the current model of
leaving decision-making to a skilled domain expert. There are several
reasons for this. First, the very nature of data mining patterns implies
that patterns are interesting because they are infrequent, and/or hidden
within the data (if not, then the patterns would most likely already be
known, and thus neither new or interesting). Thus it would be difficult to
create (ahead of time) a useful mapping from few patterns to appropriate
actions, because the patterns are (by definition) not known. Secondly, even
if the few patterns are known, it is always likely that there are many
possible actions to take. Automatically evaluating what the "best" action to
take for a particular pattern can be difficult. Until a computer system can
do better at creatively arriving at the right decision and action, it is an
area better left to human intervention rather than automated processing.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com
|