Next Article Table of Contents Previous Article

THE (IN)SIGNIFICANCE OF STATISTICAL SIGNIFICANCE
by Ed Colet

Many analytical techniques in data mining applications are based on formal statistical analysis. Parts of the formal aspects include the notion of statistical significance of discovered patterns. In this column, I discuss the notion of statistical significance. Although statistical significance may imply a result based purely on formal and objective parameters, it can quite easily be influenced by purely subjective judgments about two parameters -- alpha levels and the articulation of a null hypothesis.

A statistically significant result means that the value of an observed result falls outside a range of values that one would expect if the result were simply due to chance. For example, assume that an e-commerce retailer sees that males spend $27 more than female shoppers while shopping online via the Web site. Is this $27 difference simply due to chance, or indicative of an important trend? Statistical significance can be measured for a variety of test statistics -- based on a numerical difference as in this example, or other measures such as a correlation coefficient. The bottom line is whether the value of the test statistic could have occurred by chance. If not, then it's statistically significant.

In terms of probabilities (p-values), a statistically significant result means that the probability of getting one's observed result is less than a reference probability value used to establish statistical significance. This reference value to establish statistical significance is also known as an alpha-level. By convention, one usually uses an alpha level of 10%, 5% or 1%. The choice of alpha-level reflects the degree of "risk of being wrong" that one is willing to accept. Theoretically speaking, an alpha of 5% means that if one were to hypothetically perform a test 100 times, one would make the wrong conclusion in about 5 of these tests. Arriving at the wrong conclusion in this context means that one has concluded that the results are not due to chance, when in fact they were.

Although convention has dictated the use of alpha levels of 10%, 5% and 1% as typical reference values for statistical significance, one is certainly free to establish any reference value one wishes to use. One can for example decide to use a reference value of 20%, or 30% etc, if one wanted to. And it would be an acceptable reference value if there were strong justifications for this decision (although in most purposes, justifying such high alpha level may be difficult).

A high alpha-level is referred to as a more liberal test because it's "easier" to reach statistical significance. But the possibility of making a Type I error (concluding effects aren't due to chance when they really are) is also higher. A low alpha-level reflects a more conservative test because it's more difficult to reach statistical significance. So to minimize error, should one decide to routinely use a low alpha level? The answer is "No". The reason is that while selecting a lower alpha value also lowers the risk of a Type I error, it raises the risk of a Type II error. A Type II error is concluding that the results are due to chance, when they really weren't. Decreasing the risk of one type of error raises the risk of the other. The nature of the circumstances, and the tradeoffs between Type I and Type II errors dictate what significance level to use -- and this is really a subjective judgment on the part of the analyst.

So, given an observed result from a particular sample of data, statistical significance is determined by comparing the likelihood of getting the observed value simply by chance against the alpha level. If the p-value for the observed result is less than alpha, then one concludes that one's observed result is not likely to have been due to chance (although one can never be certain -- but is willing to risk being wrong), and the result is called statistically significant.

How does one calculate the probability of getting the observed result? Computing the p-value for one's observed result and eventually arriving at a conclusion is based on the following logic. First one specifies the characteristics of the sampling distribution that corresponds to a null hypothesis. The typical null hypothesis is the claim of "no effect" or chance is operating. In our example of spending differences between males and females, the null hypothesis could mean a zero difference between spending of males vs. females. This sampling distribution would be a normal (Gaussian) curve centered on a value of zero (no spending difference). A significance level of 5%, means that 2.5% of the area under the curve, located in each tail of the distribution represents "regions of rejection" -- corresponding to values outside an expected range for chance. If $27 falls within one of these regions, then the null hypothesis is rejected; and that the observed result is said to be different from chance and statistically significant.

Note that in formulating the null hypothesis, one could just as easily have decided to frame a null hypothesis in which the sampling distribution is not centered on zero. As long as this non-zero null hypothesis can be justified, this is also perfectly acceptable. For example, in our hypothetical scenario it may be the case that male shoppers are known to have substantially higher incomes than female shoppers, and products are marketed to males, so therefore they should be spending about $25 more per visit than females. A difference of $25 could be used as the expected difference, and the null hypothesis. The question now is whether a difference of $27 represents enough of a departure by chance from the expected difference of $25? Note that formulating the specifics of the null hypothesis is really a subjective judgment on the part of the analyst -- and obviously influences whether results turn out statistically significant or not.

So, when a result comes out to be statistically significant, it means that it is unlikely to have been due to chance. But as we've seen, there are subjective decisions about the alpha level and subjective judgments in specifying the null hypothesis. Statistical significance is influenced by these judgments -- and if these judgments are not sound, then for all practical purposes statistical significance is insignificant.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com

Top of Page


Previous Article  |  Table of Contents  |  Next Article