A SUMMARY OF SUMMARY METRICS
By Ed Colet
Data mining is a technology for the automated detection of meaningful patterns hidden in large amounts of data. Due to the sheer volume of data, it's often not practical to operate at a level that considers each and every underlying record in the data set. Instead, various summary measures or aggregations are pre-computed and the data mining algorithms then operate against these summary measures. But the particular way a summary measure is defined can be important in the subsequent analysis, and the subsequent interpretation of results. Use an inappropriate one and the results can be misleading. But given the prevalence of automated pre-computing of data, how can one ensure that appropriate summary measures are used so that valid interpretations follow?
Summary measures are important for at least two reasons. One reason is that they reduce the amount of data to be stored and analyzed and so subsequent system processing times improve. An example of this is OLAP (Online Analytical Processing), which makes it possible to query large data sets relatively quickly. A second reason for summary measures is that they can clean up unwanted noise represented as random variations in the data. Think of it as a type of smoothing operation.
Given a set of data, there are a myriad of ways to summarize the data so that it can be represented as a single number. But technically, a summary measure is a "good one" only if it has certain statistical properties that show it to be unbiased, consistent, sufficient, and relatively efficient. Common summary measures include the mode, the median, and the mean (average). Of course, based on the nature of the data, some are better than the others.
The mode is defined as the most frequently occurring value in the data set. If the data are plotted as a frequency distribution, the mode is the peak of the distribution. But if the values are grouped into class intervals and then the frequency of class intervals is plotted, then the mode is the most frequent class interval. This can potentially be problematic because the mode is therefore sensitive to the number and size of the defined intervals. If the distribution is bimodal, it can also be problematic. Therefore, it should be used for truly simple measures and truly categorical data sets.
If the values in a data set are rank ordered, then the median is defined as the value that represents the middle case. If there are an even number of cases, the data won't have an observed middle value and so the median is the value halfway between the two middle cases. The median is also defined as the point at or below exactly 50 percent of cases fall. It obviously can not be used for categorical data sets because categories can't be ranked, but it is commonly used for ordinal or interval data (numerical data that can be ordered and/or placed in meaningful intervals).
The mean is the most common summary measure for numerical data and is computed as the simple arithmetic average (variations can be weighted means, harmonic means, and geometric means).
A simple example of the use of these different measures can be shown in terms of an example of an analysis of salary data. Because salaries can vary widely with extremely high (or low) values their frequency distributions can be quite skewed. Although a mean is readily computable, it is the median that is usually the most appropriate summary measure in this case. This is because the mean will be pulled upward (or downward) by the extreme value. The median is insensitive to such variations.
In a data mining analysis of patterns associated with salaries, a hypothetical result may suggest that plastic surgeons in California earn substantially more than plastic surgeons in New York. But this may be the result of a few extremely wealthy doctors in California whose high salaries are based on their treatments of a few celebrities. An analysis using median values instead of means may more accurately reveal that salaries of plastic surgeons in California are not that different from those in New York.
Thus it is important to be able to know the determination of any summary values that underlie data mining analysis. Ideally, the user can utilize his/her domain expertise and with the aid of visualization tools to explore the characteristic of the data he/she can conduct analyses using the appropriate underlying measures. Truly automated processing without a domain expert or end-user being able to configure certain parameters can be dangerously misleading.
---
Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.
For more information, see http://www.virtualgold.com.