Next Article Table of Contents Previous Article

THE PUSH AND PULL OF INFORMATION
by Ed Colet

Data mining is the ability to automatically detect trends and patterns hidden in data. Because patterns are detected automatically by sophisticated software systems, it is not necessary for the end-user/analyst to know ahead of time what questions to ask in order to reveal trends and patterns in the data. In fact, the user may not have had to ask any questions at all in order to be presented with a set of interesting results. But it is still necessary for the user to examine, interpret and act upon data mining results. As such, one can characterize data mining results as being "pushed" to the user as opposed to being "pulled" out by the user through the use of traditional database queries, and/or conventional statistical tests. In this column, I comment on the somewhat limited success of push technologies -- and some ideas for how data mining results that are pushed to the end-user can avoid these difficulties.

The promise of push

The key notion of push technologies is that information is delivered to the user. This is in contrast to the notion of having the user pull information. Having just the right information delivered to you from the tangled web of the Internet is more appealing than forcing you to search and retrieve this information yourself. We've all experienced the frustration of having to deal with a plethora of non-relevant information when searching the Web for what one initially thought were well-defined criteria.

Early applications of push technologies pertained to the delivery of news and information content. PointCast was an early pioneer that delivered news and information on topics that the user indicated that they were interested in. Today, PointCast has ceased to exist (replaced by EntryPoint), and push technologies have lost some favor as an effective means of delivering personalized content.

Today, end-users are able to specify their interests, and providers are able to extensively profile customers in order to deliver customized goods and services (sometimes with the use of data mining). But despite achieving this paradox of mass-produced customization, it still seems that most of what is delivered is less useful and less relevant than what should be expected. Rather than using (reading, viewing, etc) 100% of what's delivered, only a small percentage is still thought to be useful. If what's delivered is supposed to be customized to our individual interests, why aren't we using all of it?

The use of information

There are a variety of reasons for why sampling only a small percentage of what is delivered characterizes our use of information. One is that there are cognitive limits to what one can reasonably absorb and retain. Even if all information that was delivered were useful and relevant, it's not possible to comprehend it all. There are various approaches that try to increase the amount that can be comprehended (e.g. use of data visualization and graphics), and new tools to address this (e.g. knowledge management applications).

A second aspect has to do with users having a limited amount of time. For the most part, the amount of time we allocate to a task is proportional to it's importance, and we all allocate our hours and minutes accordingly. For example, there's 15 minutes to read the newspaper, and a few hours devoted to academic journals and trade publications. Information that might be relevant but that we don't make time for -- i.e. consider important enough, won't get our attention. (Perhaps personalized news delivery services should also ask us about the amount of time available, in addition to asking about topics of interest -- and deliver more or less information accordingly)

Another reason why 100% of information isn't used may be because the mechanisms for specifying what one's interested in are not refined enough. For example, I may indicate an interest in following soccer news, but I may not be able to specify interest only in the English Premiership and no interest about the Mexican League. Information about the Mexican League games will thus be ignored.

It is also possible that the apparent usefulness of information varies depending on it's context. For example, an article about Mexican soccer appearing in the Wall Street Journal will pique my interest sufficiently to read the whole article, while the same news appearing in a soccer magazine might be skipped. The fact that the WSJ runs the story makes it more interesting than a soccer magazine's reporting of it.

Last but not least, pushed information is largely a passive activity requiring little from the user (other than the initial activity of specifying interests). In general, getting people actively involved in tasks and succeeding is associated with greater feeling of effectiveness and usefulness. For example, learning and understanding is more effective through reading, rather than watching TV because the former is active, and the latter is passive.

The use of data mining results

Because data mining results are often pushed to the user, there is a risk that important trends and patterns are not acted upon because they are susceptible to the factors discussed above. Some ways to ensure that this does not happen include presenting results that are compatible and consistent with the cognitive limits of human information processing. Depending on the information, graphs and visualizations can be effective at presenting a lot of information succinctly. It is also possible to monitor what information/results the user finds most useful (e.g "sales patterns associated with regions" are most useful), and have the data mining systems adaptively prioritize these patterns in subsequent analyses. Last but not least, allowing the user to interact with the data, incorporating their domain knowledge and issuing follow-up queries ensures that the process employs the right mix of active and passive involvement from the user.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com.

Top of Page


Previous Article  |  Table of Contents  |  Next Article