Next Article Table of Contents Previous Article

MAKING USE OF INFORMATION THAT'S NOT AVAILABLE
by Ed Colet

While a decision support system such as a data mining tool can provide a lot of information and reveal hidden trends and patterns, the ultimate decision is still made in the face of uncertainty and missing information. If the critical piece, or the key, is unavailable, what's the best that can be done? This column outlines some thoughts on this question, beginning with some breaking news about cryptography.

Dr. Michael Rabin and Yan Zong Bing of Harvard University are making the rounds on the academic circuit presenting their idea of a practical and secure approach to encryption. They offer a mathematical proof that an encrypted message is unbreakable. This is because the key used for encryption and decryption uses a string of random numbers during the encryption and decryption process only. The key is never retained, never stored in a computer's memory, exists only for that moment, vanishes after, and is thus secret forever.

In other words, the fact that the key vanishes means that the causal sequence of events is broken, and it is not possible to recreate the sequence of events exactly. For example, a message "A" is encrypted with a key as an "X" and then decrypted with the key to recover the "A". Without the key being available outside the "live" process for transforming the "A" to "X" to "A", the sequence is broken.

What's interesting in the context of data mining, is the notion that because the key (the mathematical formula for encoding and decoding using a stream of random numbers) is never stored in a computer's memory means it can never be recovered. If information is not stored in memory, or in a computer database, is it reachable by data mining technologies for decision-making? Quite often, missing and unrecoverable information holds the key to a business insight, the pattern interpretation, and optimal decision-making. Decision-making is always carried out in the absence of critical information.

In data mining, dealing with missing information is a common issue. There are basically two approaches to handling this. One is to minimize the amount of missing information by storing and recording as much information as possible. Decreasing costs of data storage, increasing computational power, improved efficiencies in data management, and the prevalence of electronic transactions have all contributed to making it easier to store larger amounts of data. The assumption is increasing amounts of data serves to reduce the uncertainty of information. For example, sophisticated retailers collect much more than a record of the item that was purchased. Data identifying the customer, the method of payment used, what other items were purchased, time of day, and other point-of-sale information may all be stored. This data may then be augmented with additional demographic information of customers, product inventory levels, and more. In terms of data mining, there's a richer amount of data to analyze, and a greater amount of hidden trends and patterns that can reveal themselves.

A second approach to handling missing information is to derive or infer it rather than storing it. In fact, a database design principle is to compute rather than store information into the tables if it's possible to do so (i.e. usually it's cheaper to compute than to store information). While it may relatively trivial to derive simple information (e.g. the average amount spent by customers in a week, given a record of their daily purchases), deriving more complex information is more difficult. Returning to our example of the approach to encryption, what if the key to understanding is not in the data, and cannot be derived via a straightforward computation?

The best answer here is to rely on human expertise combined sound analytic principles. This underscores the need for a human person to be integrated in the cycle of data mining technologies. A data mining system can rely on computational power to discover patterns hidden in large amounts of data. These data mining results are then subject to interpretation by the domain expert. For example an NBA coach using Advanced Scout software may interpret patterns by drawing upon his substantial expertise about the game to understand the causal nature of a surprising pattern. Quite often it is due to factors that are not recorded in the data itself, and not derivable via computations, that turn out to be most important. A hypothesis about a causal factor can then be further investigated via rigorous hypothesis testing to best account for the finding.

In terms of decision-making, it will often be the case that critical information will be missing. A computationally powerful data mining system, couple with human expertise at making the correct inferences often represents the optimal solution for decision making in the face of uncertain and missing information.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com

Top of Page


Previous Article  |  Table of Contents  |  Next Article