Next Article Table of Contents Previous Article

WHEN WHAT IS MISSING IS WHAT IS INTERESTING
by Ed Colet

Data analysis is concerned with the discovery and examination of trends and patterns that are present in the data. There are various ways to achieve this objective, but all share the fundamental notion that patterns to be examined are present in the data. In this column, I discuss the notion that what is not in the data can be interesting -- and in certain situations more useful to know. But what does it mean to ask an end-use r/data analyst to examine a pattern that is absent? And how can what is missing in data be brought to the attention of the data analyst? And does this notion go against current approaches to data analysis?

Going against prevailing methods and approaches used for data analysis is not new for those of us involved in data mining. For example, data mining's fundamental approach was initially anathema to traditional database querying (and conventional statistical analysis), but now has been recognized to be beneficial and useful. In database querying the user/analyst has to know (a priori or ahead of time) what questions to ask, and answers to these queries are retrieved. In contrast to this, data mining expands the scope of inquiry by finding answers to questions that the user/analyst did not know to ask, but whose answers are interesting enough that they should be brought to the user's attention. The famous historical example is the association between sales of diapers and sales of beer (were it not for data mining, who would have known to ask about this?). The current state of data mining algorithms and approaches are designed to discover interesting patterns buried within data.

A step beyond the current approach in data mining is to alert the user/analyst to trends and patterns that should be in the data, but are not; all the while, without requiring the user to ask for them a priori. Currently this is not widely implemented perhaps because there is already too much that is output and presented to the user for examination and interpretation. Adding information about what's not in the data in addition to reporting what's in the data may over-burden the user's task at making sense of it all.

Current approaches to data analysis also do not put much emphasis on a data- analytic approach where the absence of an outcome is seen as a valid objective (e.g. an academic paper whose experiments failed to show an effect are not likely to be published). In general, data-analytic approaches utilize one of the following general methods: (1) Predict the absence of a pattern, and flag the presence of it. This is the current approach of null hypothesis testing. Although one is truly interested in the presence of an "effect" one has to assume and test for the null hypothesis that there is no effect. Only if there is enough evidence to reject the null hypothesis does one think that the presence of an effect exists. (2) Predict absence of a pattern and flag the absence of a result. This is merely the outcome of a null hypothesis being retained. (3) Predict the presence of a pattern, and flag the discovery of it. This is an approach of "Confirmatory testing", and is generally the somewhat flawed approach of people's natural approach in reasoning. (4) Predict the presence of a pattern, and flag the absence or lack of it. It is this last approach that is not utilized as much as it possibly should -- although it is used to a limited extent to test the claims or predictions of a specific model. (i.e. theory predicts X, test for X). But in general, when predicted outcomes do not occur, the result is not deemed too "interesting".

But an example of this latter approach of identifying patterns that are missing in the data, and drawing attention to this fact is the data mining system used by the California Tax Board. The state of California apparently has as many as half a million residents that fail to file state income taxes. Through the use of data mining the California Tax Board can detect the fraudulent practice of not filing taxes. By examining historical data (past tax returns) as well as third-party data (federal W-2 forms), the state can determine who should file a return, and then identify those that don't file. The use of historical and third party data makes it relatively easy to determine expected trends and patterns -- and then detect the absence of them. As such, the objective is to identify the absence of lack of a pattern in data -- and it is this absence that is marked as truly interesting. Because most data mining algorithms already involve the computation of expected frequencies or associations, it is possible for current algorithms to also output patterns in which the lack of expected pattern (or a much lower occurrence of it) exists. This approach may prove to be especially valuable.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com.

Top of Page


Previous Article  |  Table of Contents  |  Next Article