HUMAN AND DATA MINING EXPERTISE
By Ed Colet
Data mining technologies are heavily associated with human expertise at many
levels. We've seen that successful solutions have often required a team of
analytical experts, domain experts, and technology experts. In this column,
I examine the underlying aspects of expertise -- and how two facets of
expertise when implemented in a data mining solution can result in a
successful solution providing long-term and sustained benefits.
Imagine a situation in which for a given business problem, there is a person
that can routinely point out hidden but meaningful information that
strategically affects your decision making for the better. This person
would be kept in high regard as an invaluable asset and compensated
accordingly to reward and sustain this type of performance. But consider a
slight difference in which a person routinely points out hidden
information -- but information that seems to come out of the proverbial left
field - seemingly random contributions that are only loosely connected to
the problem at hand. Because it is difficult to know what to make of this
information, it's not useful in any practical way although it may be mildly
entertaining. This person would be considered an eccentric, rather than an
expert. One would think that the difference between the expert and the
eccentric are their contributions to the business problem (or lack thereof).
But it's deeper than this - it goes to their underlying information
processing abilities.
Instead of a person being in the above scenario, consider data mining
solution/application in the role of the person instead. The "expert"
system is useful; the "eccentric" system is not. Even if both systems
relied on the same algorithms to analyze the same data, the design and
deployment of the solution determines the type of system it will become.
How can expertise be designed into a system?
An answer comes from research in cognitive science that focused on human
expertise. It is known that a characteristic of expertise is the presence
of an extensive knowledge base. As we'll see, the existence of a
substantial knowledge base helps one encode information in meaningful ways.
Research (e.g. by Hayes) has shown that regardless of the domain there are
no short cuts for humans in building their knowledge based -- it takes about
10 years to develop a sufficiently rich knowledge base. For a businessman
this could result from years of experience in the field; for an expert chess
player it could come from years of experience with sophisticated chess
strategies.
In addition to an extensive knowledge base, cognitive science has discovered
a second and subtler component to expertise is the use of superior encoding
strategies. Early studies (e.g. by de Groot) researched the performance of
chess players to try and account for the ability of experts to remember
positions of pieces on the board, and even to re-create entire games.
Superior memory seemed to be the obvious explanation. But it was later work
by Chase and Simon that determined that experts didn't have better memory
than non-experts, but that their ability to remember positions of chess
pieces on a board was based on the fact that they encoded meaningful and
named patterns (e.g. Queen's Gambit) into memory. When pieces were placed
in random or impossible positions, experts were no better than novices at
remembering them. What's important is that experts don't encode discrete
and isolated items, but encode whole formations, and thus it appeared that
experts could encode "more" information.
For a data mining system to become as useful as an expert, two aspects of
expertise should also be in place -- a rich knowledge/data store, and
effective encoding or presentation of information. Fortunately, unlike
humans, developing a knowledge base need not take 10 years. A knowledge
base consists of a long run data store of facts and relationships among
facts. The facts may already exist in the form of repositories of
historical data stored in a company's databases. Past and present queries
and systematic analyses of such data is a start for building the
relationships among facts. Transforming this into a rule-system is one way
to create a knowledge base. A second approach is to automatically discern
and learn the relationships among facts (attributes in data) and to store
this learned information back into the knowledge base as it's accumulated.
In other words, data mining results then become part of the data for future
mining.
A knowledge base makes it possible for the second principle of expertise to
occur. Superior and effective encoding/presentation of new information can
be implemented as follows. New information is determined to be interesting
and potentially useful only in the context of what's stored in the knowledge
base. This approach has been implemented in some of our work with IBM's
Advanced Scout software for professional basketball coaches. Advanced Scout
takes patterns and automatically indexes them with video segments so that
interesting patterns can be placed in an appropriate context (video) for
interpretation. So, rather than a pattern presented in isolation, it's
presented against a context of other prior knowledge. The end result is much
like a human expert pointing out interesting and useful information, rather
than isolated and random information.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com.
|