[ Table of Contents | NEXT ARTICLE ]

THE USE OF CONFIDENCE INTERVALS TO GO BEYOND SAMPLED DATA
By Ed Colet


A data mining analysis is much more open ended than a conventional statistical test. Rather than starting with a hypothesis, one starts with data, and eventually arrives at a conclusion. Along the way, there may be multiple hypotheses that are entertained, but too many to test in a conventional manner. Data that are mined may have initially been collected for reasons that have nothing to do with data mining, but the data happen to be relevant and available. So when one discovers interesting and surprising patterns as a result of data mining, the real questions to ask are: (a) does this pattern hold beyond this data set, and beyond this sample? and (b) what's the true picture of what might be going on if I were to extrapolate beyond the data that are sampled? In this column, I outline the use of Confidence Intervals as a useful tool for answering these type of questions.

Consider a hypothetical example: An organization has data pertaining to product recalls and refund requests, as well as customer satisfaction surveys. Thus, the proportions of customers that seem to satisfied or dissatisfied with the products is readily computable. Products that fail to satisfy 65% of their target audience are discontinued. But there is a product targeted to a specific set of customers that are notorious for refusing to answer customer surveys. A data mining analysis of the recall data and a few surveys reveals that 59% of these customers are satisfied. Should management discontinue the product?

A Confidence Interval (CI) can allow one to make an informed decision. Our observed result of 0.59 is based on a sample, and represents a point estimate of the true value that exists for the population (the target audience beyond the sample). A CI is an interval bound by lower and upper values that we claim (with a degree of confidence) will contain the "true" value for the population. So, the bottom line is if we can be take our result of .59, and from this be 95% certain that at least 65% of the target audience are satisfied, then the product will continue.

Computing a Confidence Interval:

The CI for a proportion is computed based on the following expressions where (p') is our observed proportion, (Z) is the "critical value" -- a number that is tied to a degree of certainty, and SE(p') is the "standard error" for a proportion.

p' - [Z * SE(p')] will give you the value for the lower bound of the interval and

p' + [Z * SE(p')] will give you the upper bound.

These show that the interval is simply (p') plus or minus a margin of error. The margin of error is based on the product of two components, the critical value Z, and the standard error. There is a unique value for Z for a given confidence measure. The value of SE(p') is affected by the size of the sample as well as the observed proportion.

The background as to how particular values of Z are linked with confidence measures is based on the following: Mathematically, in a purely theoretical situation, it turns out that 95% of all values within a normal (Gaussian) distribution are within 1.96 standard deviations (SD) of the mean. 99% of all values are within 2.58 SDs. Thus, Z values for 95% and 99% confidence are 1.96 and 2.58, respectively. But in practice, if one wanted 95% confidence, one can't simply determine values for the lower and upper bounds by multiplying 1.96 by the SD because the "true" SD (i.e. in the population) is unknown. So we estimate what this "true" SD might be based on the size of our sample - and this is what the SE(p') component represents.

It then becomes the case that we can increase our confidence rating by increasing the Z value (from 1.96 to 2.58 or more). The price of increased confidence is a wider interval. As an extreme, we can say with 100% confidence that the true proportion of satisfied customers is within an interval of 0 to 1.0. Alternatively, we can hope to reduce the SE(p') value by increasing the sample size (surveying more people). Both of these will serve to reduce the margin of error and increase our confidence in the interval.

To return to our example, even if the sample result initially shows a satisfaction level below the .65 criteria, if it can be shown that with 95% certainty and an acceptably low margin of error that the true value in the population lies within an interval that contains .65, then there is sufficient evidence that the population at large are satisfied with the product.

The use of confidence intervals is a useful analytical tool for decision making because it can link patterns that are based on a sample (as most data mining results are), with a view to what may be occurring in the corresponding population beyond the sample.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]