Next Article Table of Contents Previous Article

A FEW MOMENTS TO BETTER UNDERSTAND YOUR DATA
by Ed Colet

Describing trends and patterns in data can be a difficult task especially if the datasets are large and if analysts/end-users are not familiar with the data. Data mining has addressed this situation by bringing computing power and powerful algorithms to comprehensively explore the entire data set and output summaries that describe trends. Many algorithms implement techniques developed from statistics. In this column, I discuss four statistical measures called "moments" that are useful for concisely describing the essential characteristics of a large data set. The first two moments are already common in many data mining tools, and perhaps the latter two moments will become more common as well.

The mean or average is often called the "first moment" of a distribution and most of us are familiar with its straightforward computation. Given a group of data scores or observations, the mean is the sum divided by the number of scores. The use of the mean is one way to have a single number that best represents the central trend in the data -- in essence one has a single number that summarizes an entire dataset. For example, imagine that a company has thousands of data records on the duration of phone calls. By computing an average duration (e.g. 4 minutes) the company has a summary of the data. But just knowing a mean value is only partially useful for fully understanding the data. Knowing that the average call is 4 minutes doesn't really help one know what to make about other calls -- such as those that are 2 minutes or 7 minutes long. In order to draw some conclusions about calls that are shorter or longer than the average, one has to have a sense of the how much variation or spread there is in the data so that calls that differ from the average have a context from which they can be evaluated.

The spread of values in the data is measured by the variance or the standard deviation (square root of the variance). The variance is referred to as the "second moment about the mean". Knowing that calls are 4 minutes long, and the standard deviation is 30 seconds tells us more about the data than merely knowing the mean. If the data are normally distributed, one can conclude that approximately 95% of all calls were between 3 and 5 minutes long (plus and minus 2 standard deviations). A call that is 6 minutes long (4 standard deviations away from the mean) is therefore a rare event in the data. By the same token, if the standard deviation turned out to be 3 minutes long, it means that the data values are more variable, and a 6 minute call is less than 1 standard deviation away, and therefore not that rare in the data.

By knowing what the first and second moments are, one has two numbers that describe the central tendency in the data as well as the variability of values. Most data mining tools can readily compute these two statistical measures. But it is rare to find data mining tools that readily output higher moments of a distribution such as the third and fourth moments.

The "third moment about the mean" is the skewness of the distribution or the degree of asymmetry. If a distribution has most values in a peak located on the left side with a long and heavy tail spreading out towards the right, then it is positively skewed. (Negatively skewed distributions have long tails pointing in the opposite direction). Knowing a value for skewness provides one with more information about one's data. For example, if our hypothetical phone call data were not normally distributed, but positively skewed, one knows that even though the mean is 4, and the standard deviation is 30 seconds, there must be calls that were much longer than the 4-minute average. These calls show up in the right tail of the distribution. In fact, because it's not possible for calls to be less than 0 minutes long, but possible to have calls of any duration, then one would expect a distribution of the duration of phone calls to always be positively skewed.

The "fourth moment about the mean" is the kurtosis, or the degree of "peakedness" (or "pointedness") of a distribution relative to a symmetric normal distribution. If the distribution has a relatively high peak it will have high values and is called leptokurtic. Flat-topped distributions are referred to as platykurtic. The standard normal distribution is mesokurtic (neither peaked, nor flat-topped). If the distribution is symmetrical, knowing the kurtosis value gives one a sense of how quickly values cluster at or close to the average value.

An alternative way to fully describe the distribution is via mathematical function or equation. But if the general mathematical function to describe the distribution is unknown, and/or the descriptive equation is difficult to state, then knowing the entire set of moments can determine the characteristics of the distribution exactly without having to formulate an equation. Knowing the first four moments about the mean (average, variance, skewness and kurtosis) specifies the central trend, the spread of values, the degree of asymmetry and the peakedness of the data values comprising the distribution. Thus, with only a few moments, one can go from several thousand data points to a set of only 4 values and arrive at a firm understanding of the essential characteristics of the data.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com

Top of Page


Previous Article  |  Table of Contents  |  Next Article