THE PUSH AND PULL OF INFORMATION
by Ed Colet
Data mining is the ability to automatically detect trends and patterns
hidden in data. Because patterns are detected automatically by sophisticated
software systems, it is not necessary for the end-user/analyst to know ahead
of time what questions to ask in order to reveal trends and patterns in the
data. In fact, the user may not have had to ask any questions at all in order
to be presented with a set of interesting results. But it is still necessary
for the user to examine, interpret and act upon data mining results. As such,
one can characterize data mining results as being "pushed" to the user as
opposed to being "pulled" out by the user through the use of traditional
database queries, and/or conventional statistical tests. In this column, I
comment on the somewhat limited success of push technologies -- and some ideas
for how data mining results that are pushed to the end-user can avoid these
difficulties.
The promise of push
The key notion of push technologies is that information is delivered to the
user. This is in contrast to the notion of having the user pull information.
Having just the right information delivered to you from the tangled web of the
Internet is more appealing than forcing you to search and retrieve this
information yourself. We've all experienced the frustration of having to deal
with a plethora of non-relevant information when searching the Web for what
one initially thought were well-defined criteria.
Early applications of push technologies pertained to the delivery of news
and information content. PointCast was an early pioneer that delivered news
and information on topics that the user indicated that they were interested
in. Today, PointCast has ceased to exist (replaced by EntryPoint), and push
technologies have lost some favor as an effective means of delivering
personalized content.
Today, end-users are able to specify their interests, and providers are able
to extensively profile customers in order to deliver customized goods and
services (sometimes with the use of data mining). But despite achieving this
paradox of mass-produced customization, it still seems that most of what is
delivered is less useful and less relevant than what should be expected.
Rather than using (reading, viewing, etc) 100% of what's delivered, only a
small percentage is still thought to be useful. If what's delivered is
supposed to be customized to our individual interests, why aren't we using all
of it?
The use of information
There are a variety of reasons for why sampling only a small percentage of
what is delivered characterizes our use of information. One is that there are
cognitive limits to what one can reasonably absorb and retain. Even if all
information that was delivered were useful and relevant, it's not possible to
comprehend it all. There are various approaches that try to increase the
amount that can be comprehended (e.g. use of data visualization and graphics),
and new tools to address this (e.g. knowledge management applications).
A second aspect has to do with users having a limited amount of time. For
the most part, the amount of time we allocate to a task is proportional to
it's importance, and we all allocate our hours and minutes accordingly. For
example, there's 15 minutes to read the newspaper, and a few hours devoted to
academic journals and trade publications. Information that might be relevant
but that we don't make time for -- i.e. consider important enough, won't get
our attention. (Perhaps personalized news delivery services should also ask
us about the amount of time available, in addition to asking about topics of
interest -- and deliver more or less information accordingly)
Another reason why 100% of information isn't used may be because the
mechanisms for specifying what one's interested in are not refined enough. For
example, I may indicate an interest in following soccer news, but I may not be
able to specify interest only in the English Premiership and no interest about
the Mexican League. Information about the Mexican League games will thus be
ignored.
It is also possible that the apparent usefulness of information varies
depending on it's context. For example, an article about Mexican soccer
appearing in the Wall Street Journal will pique my interest sufficiently to
read the whole article, while the same news appearing in a soccer magazine
might be skipped. The fact that the WSJ runs the story makes it more
interesting than a soccer magazine's reporting of it.
Last but not least, pushed information is largely a passive activity
requiring little from the user (other than the initial activity of specifying
interests). In general, getting people actively involved in tasks and
succeeding is associated with greater feeling of effectiveness and usefulness.
For example, learning and understanding is more effective through reading,
rather than watching TV because the former is active, and the latter is
passive.
The use of data mining results
Because data mining results are often pushed to the user, there is a risk
that important trends and patterns are not acted upon because they are
susceptible to the factors discussed above. Some ways to ensure that this
does not happen include presenting results that are compatible and consistent
with the cognitive limits of human information processing. Depending on the
information, graphs and visualizations can be effective at presenting a lot of
information succinctly. It is also possible to monitor what
information/results the user finds most useful (e.g "sales patterns associated
with regions" are most useful), and have the data mining systems adaptively
prioritize these patterns in subsequent analyses. Last but not least,
allowing the user to interact with the data, incorporating their domain
knowledge and issuing follow-up queries ensures that the process employs the
right mix of active and passive involvement from the user.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com.
|