A FEW MOMENTS TO BETTER UNDERSTAND YOUR DATA
by Ed Colet
Describing trends and patterns in data can be a difficult task especially if
the datasets are large and if analysts/end-users are not familiar with the
data. Data mining has addressed this situation by bringing computing power
and powerful algorithms to comprehensively explore the entire data set and
output summaries that describe trends. Many algorithms implement techniques
developed from statistics. In this column, I discuss four statistical
measures called "moments" that are useful for concisely describing the
essential characteristics of a large data set. The first two moments are
already common in many data mining tools, and perhaps the latter two moments
will become more common as well.
The mean or average is often called the "first moment" of a distribution and
most of us are familiar with its straightforward computation. Given a group
of data scores or observations, the mean is the sum divided by the number of
scores. The use of the mean is one way to have a single number that best
represents the central trend in the data -- in essence one has a single
number that summarizes an entire dataset. For example, imagine that a
company has thousands of data records on the duration of phone calls. By
computing an average duration (e.g. 4 minutes) the company has a summary of
the data. But just knowing a mean value is only partially useful for fully
understanding the data. Knowing that the average call is 4 minutes doesn't
really help one know what to make about other calls -- such as those that
are 2 minutes or 7 minutes long. In order to draw some conclusions about
calls that are shorter or longer than the average, one has to have a sense
of the how much variation or spread there is in the data so that calls that
differ from the average have a context from which they can be evaluated.
The spread of values in the data is measured by the variance or the standard
deviation (square root of the variance). The variance is referred to as the
"second moment about the mean". Knowing that calls are 4 minutes long, and
the standard deviation is 30 seconds tells us more about the data than
merely knowing the mean. If the data are normally distributed, one can
conclude that approximately 95% of all calls were between 3 and 5 minutes
long (plus and minus 2 standard deviations). A call that is 6 minutes long
(4 standard deviations away from the mean) is therefore a rare event in the
data. By the same token, if the standard deviation turned out to be 3
minutes long, it means that the data values are more variable, and a 6
minute call is less than 1 standard deviation away, and therefore not that
rare in the data.
By knowing what the first and second moments are, one has two numbers that
describe the central tendency in the data as well as the variability of
values. Most data mining tools can readily compute these two statistical
measures. But it is rare to find data mining tools that readily output
higher moments of a distribution such as the third and fourth moments.
The "third moment about the mean" is the skewness of the distribution or the
degree of asymmetry. If a distribution has most values in a peak located on
the left side with a long and heavy tail spreading out towards the right,
then it is positively skewed. (Negatively skewed distributions have long
tails pointing in the opposite direction). Knowing a value for skewness
provides one with more information about one's data. For example, if our
hypothetical phone call data were not normally distributed, but positively
skewed, one knows that even though the mean is 4, and the standard deviation
is 30 seconds, there must be calls that were much longer than the 4-minute
average. These calls show up in the right tail of the distribution. In
fact, because it's not possible for calls to be less than 0 minutes long,
but possible to have calls of any duration, then one would expect a
distribution of the duration of phone calls to always be positively skewed.
The "fourth moment about the mean" is the kurtosis, or the degree of
"peakedness" (or "pointedness") of a distribution relative to a symmetric
normal distribution. If the distribution has a relatively high peak it will
have high values and is called leptokurtic. Flat-topped distributions are
referred to as platykurtic. The standard normal distribution is mesokurtic
(neither peaked, nor flat-topped). If the distribution is symmetrical,
knowing the kurtosis value gives one a sense of how quickly values cluster
at or close to the average value.
An alternative way to fully describe the distribution is via mathematical
function or equation. But if the general mathematical function to describe
the distribution is unknown, and/or the descriptive equation is difficult to
state, then knowing the entire set of moments can determine the
characteristics of the distribution exactly without having to formulate an
equation. Knowing the first four moments about the mean (average, variance,
skewness and kurtosis) specifies the central trend, the spread of values,
the degree of asymmetry and the peakedness of the data values comprising the
distribution. Thus, with only a few moments, one can go from several
thousand data points to a set of only 4 values and arrive at a firm
understanding of the essential characteristics of the data.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com
|