[ Table of Contents | NEXT ARTICLE ]

USING A CORRELATION RATIO TO MEASURE AN ASSOCIATION
By Ed Colet


A prevalent issue in a data mining analysis is computing a measure of association between attributes or variables. The way the underlying algorithms compute this measure can vary widely from one application to another. Some seem to be ad hoc, while others are firmly rooted in statistics, such as a common measure of correlation known as the "Pearson r". In this column, I briefly describe the basic approach to computing a Pearson value, and point out that there are some caveats to be aware ofespecially in the types of data typically analyzed with data mining software. Because of this, it's important to consider adopting a slightly different computational approach that may be better suited in certain data mining situations.

A common measure of association known as the Pearson correlation coefficient and denoted as "r" is really just a measure of a linear relationship between two variables plotted on the x and y axes of a scatter-plot. For example, an retail organization may have data from several store locations that shows a scatter of points indicating that in general, as sales (x-axis) increase, revenue (y-axis) increases. One essentially summarizes this scatter of data points by fitting a straight line through these points in the best possible way. The best fitting straight line is determined by placing a line through these points in such a way that the distance of each point from this line is minimized (technically, it's the sum of the squared distances that is minimized). The line can then be expressed in the form of an equation and this serves as one's model.

Some cautions with the approach just described should be noted, especially in the context of data mining. First is the assumption that the relationship is linear. It's possible that the relationship may be curvilinear, and thus the best fitting straight line will never be the best model for the data. Second, the variables plotted on the scatter-plot are assumed to be truly quantitative (i.e. have an orderly progression), and can thus be ordered on the axes. But in many data mining situations, one can not safely assume that the relationship between attributes is linear, and it may not be practical to view scatter-plots of all of the possible pair-wise comparisons. In data mining situations, it is also the case that many variables are categorical rather than numerical (e.g. a variable identifying customer segments). So, lets hypothetically assume that there's a curvilinear relationship between customer segments and sales (i.e. assume for the moment that customer segments are ordered on the x-axis so that as x increases the values on the y-axis to indicate sales also increases and then levels off). The retail organization wants to see if there's an association between the customer segments and sales. Clearly, using the approach described earlier will be misleading.

In this case, there is an alternative computational approach also available from statistics to measure this type of association. And as we'll see, imposing an ordered sequence upon customer segments is not problematic. The "correlation ratio" measure (denoted in Greek as "eta") is computed in the following way: Partition the x-axis (customer segments) into columns and calculate a mean sales amount for that interval. In some cases the partitioning of x is readily suggested by the attribute in question. The distance of each data point inside an interval from the value for the column mean is computed. Just as columns are partitioned, the same procedure can be computed for rows. We then have an "eta of x on y" and an "eta of y on x". The two form the correlation ratio.

This computational approach has several positive implications. If the underlying pattern was linear, then this approach would result in a value similar to what would have been computed using the "Pearson r". Thus there's no harm done if the relationship is actually linear. Because the computation is based on the distance of each point from the mean of its interval, then one can think of the computations as operating on "separate strips of data". As such, imposing an ordered sequence onto an attribute doesn't have an effect on the end result of the computations. In other words, it becomes possible to quantitatively analyze a categorical variable whose values do not have an inherent ordering.

So, in data used in data mining where the nature of the relationship is unclear (linear or not?) and some variables may be purely categorical (thus unordered), it is still possible to arrive at a well-founded measure of association.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]