[ Table of Contents | NEXT ARTICLE ]

RUSHING INTO THE NEXT CENTURY
by Ed Colet


As we enter the new millennium, I find myself agreeing with the author James Gleick's observation that a defining quality of today's modern age has been the "acceleration of just about everything". From my perspective, data mining is a technology designed to let businesses act on information more quickly than before. Making use of information hidden in data provides competitive advantages, but only if you can harness this knowledge in a timely manner - i.e. before your competitors take advantage of it, and as soon as the opportunities present themselves. For purposes of today's column, I speculate on how the faster turnaround times and accelerated schedules of today may affect the evolution of data mining technologies in this new year and beyond.

Data mining technologies are designed to handle large amounts of data. In the past, data had been collected and stored because it was technically feasible to do so given advances in database technology. Storing it was one thing, analyzing it was another simply because the technologies for large scale data analysis were lagging behind database technologies. As a consequence, there was more data than people knew what to do with. Data mining technology, as a relatively recent development provides the ability to tap into these large data stores and extract meaningful patterns buried within them.

But in this age of rapid change and accelerated pace, these large data stores of historical data can have somewhat limited utility. For example, imagine that a retailer has accumulated years of data pertaining to the organization of products on its supermarket shelves along with sales and revenue figures. The retailer has discovered surprising combinations of products that sell well together, and so the retailer re-organizes the layout of products on shelves to fully exploit this hidden pattern. After this intervention, what happens to the historical data? Can it really be mined as before for other hidden patterns since the state of things have now changed? As such, any comparisons of new trends with this repository of historical data may be less meaningful and useful.

Away from brick and mortar retailers and onto the online world, one finds that similar issues exist. In the online world, the notion of historical data may not span multiple years but only the previous year. Compared to last year, this year's online sales were in the billions of dollars and traffic volume correspondingly higher as well (e.g. Barnesandnoble.com volume up approximately 200%, Yahoo up 385%, as reported in InternetWeek Newsletter, 12/29/1999). As e-commerce grows and online sales inevitably increase, how should one determine historical benchmarks for meaningful comparisons?

The importance of a meaningful standard of comparison is not only important for interpreting results, but also important at the level of the programming code that carries out the analytical routines in data mining software. In order to determine an "interesting pattern", algorithms typically compare a standard of comparison (based on expected probabilities that may in turn be based on historical data) versus a set of current values. Historical baselines and standards of comparisons may have to frequently be revised in order for truly meaningful results to be output.

The fact that the "useful shelf life" of stored data is becoming shorter and shorter has implications for future trends and directions for data mining technologies. The trend appears to be towards more frequent and cumulative data mining analyses, and perhaps eventually evolving into real-time monitoring systems.

Today, much data mining analysis is conducted "off-line"; data are stored and reports are generated but this is changing. Not too long ago, a large financial institution generated their data mining reports quarterly. It was recognized that this wasn't timely enough to be useful, and after substantial re-engineering, they were able to reduce this down to a monthly interval - but still an off-line process. A second example is the use of data mining by one of the teams in the NBA with IBM's Advanced Scout software. When we designed the software, it was intended for pre-game and post-game analysis. But one of the teams discovered that it can actually be used during half-time of the current game. Why wait to take advantage of information after the game is over, when it can be put to use for the second half? Both of these examples exemplify a more frequent reliance upon data mining from an off-line process towards a more on-line activity.

As the intervals between analytical runs shorten, there is a trend away from a reliance upon historical data, in favor of a cumulative reliance on the past results/reports. Incidentally, a Bayesian framework is well suited to this type of approach, and it's not a coincidence that there is a notable increase in the number of published papers on Bayesian methods (Science, v 286, 11/19/99).

Taken to it's logical conclusion, more frequent data mining analyses evolving into real-time, ongoing data mining essentially means that data mining will evolve into a monitoring system - an automated system designed to alert the right people to interesting (or alarming) trends as they're detected. We currently see this real-time analysis in web-site technologies, albeit for purposes of system performance issues such as load-balancing. It's not a stretch to imagine adopting such real-time data mining analysis in order to deliver personalized and dynamic content to web-site visitors.

The evolution of data mining into a monitoring technology has some interesting implications. In the case of fraud detection systems, it's overwhelmingly beneficial. But in other domains one may need to be cautious. For example, a monitoring system for economic or stock market data mining for personal investments may be risky. Is a sudden drop in a stock simply a random perturbation or the beginning of a downward spiral? How would one know? Hindsight being 20-20, one would know by comparing it to historical trends and a long term view of past historical fluctuations. But as we've seen, in this accelerated age of Internet time, one's view of history may be too shortened to be useful.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]