THE (IN)SIGNIFICANCE OF STATISTICAL SIGNIFICANCE
by Ed Colet
Many analytical techniques in data mining applications are based on formal
statistical analysis. Parts of the formal aspects include the notion of
statistical significance of discovered patterns. In this column, I discuss
the notion of statistical significance. Although statistical significance
may imply a result based purely on formal and objective parameters, it can
quite easily be influenced by purely subjective judgments about two
parameters -- alpha levels and the articulation of a null hypothesis.
A statistically significant result means that the value of an observed
result falls outside a range of values that one would expect if the result
were simply due to chance. For example, assume that an e-commerce retailer
sees that males spend $27 more than female shoppers while shopping online
via the Web site. Is this $27 difference simply due to chance, or
indicative of an important trend? Statistical significance can be measured
for a variety of test statistics -- based on a numerical difference as in
this example, or other measures such as a correlation coefficient. The
bottom line is whether the value of the test statistic could have occurred
by chance. If not, then it's statistically significant.
In terms of probabilities (p-values), a statistically significant result
means that the probability of getting one's observed result is less than a
reference probability value used to establish statistical significance.
This reference value to establish statistical significance is also known as
an alpha-level. By convention, one usually uses an alpha level of 10%, 5%
or 1%. The choice of alpha-level reflects the degree of "risk of being
wrong" that one is willing to accept. Theoretically speaking, an alpha of
5% means that if one were to hypothetically perform a test 100 times, one
would make the wrong conclusion in about 5 of these tests. Arriving at the
wrong conclusion in this context means that one has concluded that the
results are not due to chance, when in fact they were.
Although convention has dictated the use of alpha levels of 10%, 5% and 1%
as typical reference values for statistical significance, one is certainly
free to establish any reference value one wishes to use. One can for
example decide to use a reference value of 20%, or 30% etc, if one wanted
to. And it would be an acceptable reference value if there were strong
justifications for this decision (although in most purposes, justifying such
high alpha level may be difficult).
A high alpha-level is referred to as a more liberal test because it's
"easier" to reach statistical significance. But the possibility of making a
Type I error (concluding effects aren't due to chance when they really are)
is also higher. A low alpha-level reflects a more conservative test because
it's more difficult to reach statistical significance. So to minimize
error, should one decide to routinely use a low alpha level? The answer is
"No". The reason is that while selecting a lower alpha value also lowers the
risk of a Type I error, it raises the risk of a Type II error. A Type II
error is concluding that the results are due to chance, when they really
weren't. Decreasing the risk of one type of error raises the risk of the
other. The nature of the circumstances, and the tradeoffs between Type I
and Type II errors dictate what significance level to use -- and this is
really a subjective judgment on the part of the analyst.
So, given an observed result from a particular sample of data, statistical
significance is determined by comparing the likelihood of getting the
observed value simply by chance against the alpha level. If the p-value for
the observed result is less than alpha, then one concludes that one's
observed result is not likely to have been due to chance (although one can
never be certain -- but is willing to risk being wrong), and the result is
called statistically significant.
How does one calculate the probability of getting the observed result?
Computing the p-value for one's observed result and eventually arriving at a
conclusion is based on the following logic. First one specifies the
characteristics of the sampling distribution that corresponds to a null
hypothesis. The typical null hypothesis is the claim of "no effect" or
chance is operating. In our example of spending differences between males
and females, the null hypothesis could mean a zero difference between
spending of males vs. females. This sampling distribution would be a normal
(Gaussian) curve centered on a value of zero (no spending difference). A
significance level of 5%, means that 2.5% of the area under the curve,
located in each tail of the distribution represents "regions of
rejection" -- corresponding to values outside an expected range for chance.
If $27 falls within one of these regions, then the null hypothesis is
rejected; and that the observed result is said to be different from chance
and statistically significant.
Note that in formulating the null hypothesis, one could just as easily have
decided to frame a null hypothesis in which the sampling distribution is not
centered on zero. As long as this non-zero null hypothesis can be
justified, this is also perfectly acceptable. For example, in our
hypothetical scenario it may be the case that male shoppers are known to
have substantially higher incomes than female shoppers, and products are
marketed to males, so therefore they should be spending about $25 more per
visit than females. A difference of $25 could be used as the expected
difference, and the null hypothesis. The question now is whether a
difference of $27 represents enough of a departure by chance from the
expected difference of $25? Note that formulating the specifics of the
null hypothesis is really a subjective judgment on the part of the
analyst -- and obviously influences whether results turn out statistically
significant or not.
So, when a result comes out to be statistically significant, it means that
it is unlikely to have been due to chance. But as we've seen, there are
subjective decisions about the alpha level and subjective judgments in
specifying the null hypothesis. Statistical significance is influenced by
these judgments -- and if these judgments are not sound, then for all
practical purposes statistical significance is insignificant.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York University's
Department of Psychology. Ed has also worked for IBM Research at the T.J.
Watson Research Center. At IBM, Ed was a member of the group that developed
Advanced Scout, the data mining application for NBA teams. His research
interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com
|