MAKING USE OF INFORMATION THAT'S NOT AVAILABLE
by Ed Colet
While a decision support system such as a data mining tool can provide a lot
of information and reveal hidden trends and patterns, the ultimate decision is
still made in the face of uncertainty and missing information. If the
critical piece, or the key, is unavailable, what's the best that can be done?
This column outlines some thoughts on this question, beginning with some
breaking news about cryptography.
Dr. Michael Rabin and Yan Zong Bing of Harvard University are making the
rounds on the academic circuit presenting their idea of a practical and secure
approach to encryption. They offer a mathematical proof that an encrypted
message is unbreakable. This is because the key used for encryption and
decryption uses a string of random numbers during the encryption and
decryption process only. The key is never retained, never stored in a
computer's memory, exists only for that moment, vanishes after, and is thus
secret forever.
In other words, the fact that the key vanishes means that the causal sequence
of events is broken, and it is not possible to recreate the sequence of events
exactly. For example, a message "A" is encrypted with a key as an "X" and then
decrypted with the key to recover the "A". Without the key being available
outside the "live" process for transforming the "A" to "X" to "A", the
sequence is broken.
What's interesting in the context of data mining, is the notion that because
the key (the mathematical formula for encoding and decoding using a stream of
random numbers) is never stored in a computer's memory means it can never be
recovered. If information is not stored in memory, or in a computer
database, is it reachable by data mining technologies for decision-making?
Quite often, missing and unrecoverable information holds the key to a business
insight, the pattern interpretation, and optimal decision-making.
Decision-making is always carried out in the absence of critical information.
In data mining, dealing with missing information is a common issue. There are
basically two approaches to handling this. One is to minimize the amount of
missing information by storing and recording as much information as possible.
Decreasing costs of data storage, increasing computational power, improved
efficiencies in data management, and the prevalence of electronic transactions
have all contributed to making it easier to store larger amounts of data. The
assumption is increasing amounts of data serves to reduce the uncertainty of
information. For example, sophisticated retailers collect much more than a
record of the item that was purchased. Data identifying the customer, the
method of payment used, what other items were purchased, time of day, and
other point-of-sale information may all be stored. This data may then be
augmented with additional demographic information of customers, product
inventory levels, and more. In terms of data mining, there's a richer amount
of data to analyze, and a greater amount of hidden trends and patterns that
can reveal themselves.
A second approach to handling missing information is to derive or infer it
rather than storing it. In fact, a database design principle is to compute
rather than store information into the tables if it's possible to do so (i.e.
usually it's cheaper to compute than to store information). While it may
relatively trivial to derive simple information (e.g. the average amount spent
by customers in a week, given a record of their daily purchases), deriving
more complex information is more difficult. Returning to our example of the
approach to encryption, what if the key to understanding is not in the data,
and cannot be derived via a straightforward computation?
The best answer here is to rely on human expertise combined sound analytic
principles. This underscores the need for a human person to be integrated in
the cycle of data mining technologies. A data mining system can rely on
computational power to discover patterns hidden in large amounts of data.
These data mining results are then subject to interpretation by the domain
expert. For example an NBA coach using Advanced Scout software may interpret
patterns by drawing upon his substantial expertise about the game to
understand the causal nature of a surprising pattern. Quite often it is due
to factors that are not recorded in the data itself, and not derivable via
computations, that turn out to be most important. A hypothesis about a causal
factor can then be further investigated via rigorous hypothesis testing to
best account for the finding.
In terms of decision-making, it will often be the case that critical
information will be missing. A computationally powerful data mining system,
couple with human expertise at making the correct inferences often represents
the optimal solution for decision making in the face of uncertain and missing
information.
Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible
for developing analytical methods for data mining and for investigating human
factors and usability issues of business intelligence systems. At present, he
is in the final stage of completing a doctoral dissertation in the Cognition
and Perception program at New York University's Department of Psychology. Ed
has also worked for IBM Research at the T.J. Watson Research Center. At IBM,
Ed was a member of the group that developed Advanced Scout, the data mining
application for NBA teams. His research interests focus on statistical methods
and human factors.
For more information, see www.virtualgold.com
|