WHEN WHAT IS MISSING IS WHAT IS INTERESTING
by Ed Colet
Data analysis is concerned with the discovery and examination of trends and
patterns that are present in the data. There are various ways to achieve this
objective, but all share the fundamental notion that patterns to be examined
are present in the data. In this column, I discuss the notion that what is
not in the data can be interesting -- and in certain situations more useful to
know. But what does it mean to ask an end-use r/data analyst to examine a
pattern that is absent? And how can what is missing in data be brought to the
attention of the data analyst? And does this notion go against current
approaches to data analysis?
Going against prevailing methods and approaches used for data analysis is
not new for those of us involved in data mining. For example, data mining's
fundamental approach was initially anathema to traditional database querying
(and conventional statistical analysis), but now has been recognized to be
beneficial and useful. In database querying the user/analyst has to know (a
priori or ahead of time) what questions to ask, and answers to these queries
are retrieved. In contrast to this, data mining expands the scope of inquiry
by finding answers to questions that the user/analyst did not know to ask,
but whose answers are interesting enough that they should be brought to the
user's attention. The famous historical example is the association between
sales of diapers and sales of beer (were it not for data mining, who would
have known to ask about this?). The current state of data mining algorithms
and approaches are designed to discover interesting patterns buried within
data.
A step beyond the current approach in data mining is to alert the
user/analyst to trends and patterns that should be in the data, but are not;
all the while, without requiring the user to ask for them a priori. Currently
this is not widely implemented perhaps because there is already too much that
is output and presented to the user for examination and interpretation.
Adding information about what's not in the data in addition to reporting
what's in the data may over-burden the user's task at making sense of it all.
Current approaches to data analysis also do not put much emphasis on a data-
analytic approach where the absence of an outcome is seen as a valid objective
(e.g. an academic paper whose experiments failed to show an effect are not
likely to be published). In general, data-analytic approaches utilize one of
the following general methods: (1) Predict the absence of a pattern, and flag
the presence of it. This is the current approach of null hypothesis testing.
Although one is truly interested in the presence of an "effect" one has to
assume and test for the null hypothesis that there is no effect. Only if
there is enough evidence to reject the null hypothesis does one think that the
presence of an effect exists. (2) Predict absence of a pattern and flag the
absence of a result. This is merely the outcome of a null hypothesis being
retained. (3) Predict the presence of a pattern, and flag the discovery of
it. This is an approach of "Confirmatory testing", and is generally the
somewhat flawed approach of people's natural approach in reasoning. (4)
Predict the presence of a pattern, and flag the absence or lack of it. It is
this last approach that is not utilized as much as it possibly should --
although it is used to a limited extent to test the claims or predictions of a
specific model. (i.e. theory predicts X, test for X). But in general, when
predicted outcomes do not occur, the result is not deemed too "interesting".
But an example of this latter approach of identifying patterns that are
missing in the data, and drawing attention to this fact is the data mining
system used by the California Tax Board. The state of California apparently
has as many as half a million residents that fail to file state income taxes.
Through the use of data mining the California Tax Board can detect the
fraudulent practice of not filing taxes. By examining historical data (past
tax returns) as well as third-party data (federal W-2 forms), the state can
determine who should file a return, and then identify those that don't file.
The use of historical and third party data makes it relatively easy to
determine expected trends and patterns -- and then detect the absence of them.
As such, the objective is to identify the absence of lack of a pattern in data
-- and it is this absence that is marked as truly interesting. Because most
data mining algorithms already involve the computation of expected frequencies
or associations, it is possible for current algorithms to also output patterns
in which the lack of expected pattern (or a much lower occurrence of it)
exists. This approach may prove to be especially valuable.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com.
|