Next Article Table of Contents Previous Article

DISCOVERING UNDERLYING CAUSES: PART I
by Ed Colet

A basic analytical approach in many data mining algorithms is the discovery of associations among the attributes or variables in large amounts of data.

These associations are often referred to as correlations. A correlation reflects the discovery of a dependency among the attributes, and often it's quite sufficient to be able to predict new values. In this column, I address what can be thought of as the opposite of the forward-looking perspective of predictive modeling. Rather than making decisions on predicted values, decisions can also be made on the basis of looking backwards to underlying causal effects. In part I of this series, I describe the common computational approach for determining a correlation, and the current ways to investigate causality. Next week, in Part II, I discuss some newer approaches for modeling causal relationships.

The discovery of an association among attributes in a dataset essentially boils down to an assessment of probabilities. By definition, if 'A' and 'B' are independent, then the probability of their joint occurrence is equal to the product of their individual probabilities. Formally, P(AB)= P(A)P(B). If this equation does not hold true, then A and B are not independent. In other words, a dependency between 'A' and 'B' exists.

The magnitude and direction of the dependence between A and B can also be measured formally. In terms of a statistical correlation the result will be a value that will range from -1.0 and +1.0. The computation of this statistical correlation can be illustrated as follows. Given a dataset of N records and attributes 'A' and 'B', we can generate the distribution of values for 'A' and a distribution for values of 'B'. Scores on these di stributions can be expressed as z-scores, which is simply a transformation so that the mean of a distribution is centered on a value 0, with a standard deviation of 1. This makes is possible to compare distributions that have different underlying measures such as stress-level (blood pressure) and age (in years). Average stress-level and average age are now set to a value of zero in each distribution. A correlation is simply the average of the cross products of these Z-scores. Formally, this is "r = sum of [Z(A) x Z(B)] / N". Because the sum of cross products is always maximized when the multiplied numbers are the same, the highest value for a correlation will result when multiplying a set of z-scores that are identical This means that each record's z-score value on one distribution is exactly the same as it's value on the other distribution. The magnitude of the correlation reflects the degree of association, while the sign (+ or -) reflects the direction of change in one value relative to the other -- i.e. a positive correlation means that the values of both attributes change in the same direction; a negative correlation means that as one value increases, the other decreases.

But given a correlation value, one often wants to know more about the nature of relationship between 'A' and 'B' than what the correlation value can tell us. Correlation does not mean causation. It also doesn't tell us which attribute depends on the other. Does 'A' depend on 'B', or 'B' depend on 'A'? A correlation between age and stress-level does not conclusively imply that aging causes stress, or vice versa. There are two basic approaches to investigate causality. One is to remove or control for the presence of other possible causal influences, the other is to determine and test for underlying mechanisms.

An ever-present counter-claim to any suggestion of causality based on an observed correlation is the possibility that a third variable is operating as an underlying cause. For example, Socio-Economic Status (SES) levels may be a third factor that is responsible for affecting both stress levels and age. The way to address the role of a moderating attribute or a confounding variable is to remove the influence of, or control for the effect of this confound. In our hypothetical example, dividing records into class intervals of SES levels, and then computing the correlations for the original values by SES class intervals can determine if SES is a moderating factor. If it turns out that there is little or no relationship between the original two variables, then this third variable can be said to explain the original finding.

This can be done mathematically, by computing what are called "partial correlations". The correlation of 'A' and 'B' controlling for 'C' can be easily computed knowing the component correlations between 'AB', 'AC', and 'BC'. The same technique can generalize to more than three attributes, and is referred to as "higher order partial correlations". It is even possible to control the effect on a third attribute from one of the other attributes (rather than both), by computing "semi-partial correlations". These approaches all serve to address whether another attribute is responsible for an observed effect.

Note that a weakness of computing partial or semi-partial correlations is that it is always possible to claim that another factor that is not even in the data is causally responsible for the observed effect. The approach of partial or semi-partial correlations cannot address this type of counter argument. In this situation, proposing and investigating underlying mechanisms that explain the observed relationship can address determining causality. In fact, the correlation between smoking and lung cancer, and the claim that smoking causes lung cancer were addressed using this approach of proposing an underlying mechanism. The causal link was postulated via a mechanism that smoking introduces toxic substances to human tissues, and these toxins are responsible for cancer. This mechanism implied other patterns such as "smokers have elevated rates of other respiratory diseases". Empirical support established also via correlations ultimately made for a coherent and persuasive argument for a toxic-smoke-mechanism as the underlying causal factor.

In terms of data mining software, it is possible to investigate the presence of confounding or moderating factors via the computation of partial or semi-partial correlations whenever they may be present in the data. Given that today's datasets are quite large with numerous attributes, this can be routinely implemented. In terms of causal factors outside the data being responsible for observed patterns, it is necessary for a skilled domain expert to postulate an underlying causal mechanism(s). These causal mechanisms imply the presence of other patterns that can then be investigated also through the use of correlations. Taken together, it becomes possible to establish a strong understanding of causal factors underlying an observed association.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com

Top of Page


Previous Article  |  Table of Contents  |  Next Article