DISCOVERING UNDERLYING CAUSES: PART I
by Ed Colet
A basic analytical approach in many data mining algorithms is the discovery
of associations among the attributes or variables in large amounts of
data.
These associations are often referred to as correlations. A correlation
reflects the discovery of a dependency among the attributes, and often it's
quite sufficient to be able to predict new values. In this column, I address
what can be thought of as the opposite of the forward-looking perspective of
predictive modeling. Rather than making decisions on predicted values,
decisions can also be made on the basis of looking backwards to underlying
causal effects. In part I of this series, I describe the common computational
approach for determining a correlation, and the current ways to investigate
causality. Next week, in Part II, I discuss some newer approaches for
modeling causal relationships.
The discovery of an association among attributes in a dataset essentially
boils down to an assessment of probabilities. By definition, if 'A' and 'B'
are independent, then the probability of their joint occurrence is equal to
the product of their individual probabilities. Formally, P(AB)= P(A)P(B). If
this equation does not hold true, then A and B are not independent. In other
words, a dependency between 'A' and 'B' exists.
The magnitude and direction of the dependence between A and B can also be
measured formally. In terms of a statistical correlation the result will be a
value that will range from -1.0 and +1.0. The computation of this statistical
correlation can be illustrated as follows. Given a dataset of N records and
attributes 'A' and 'B', we can generate the distribution of values for 'A' and
a distribution for values of 'B'. Scores on these di stributions can be
expressed as z-scores, which is simply a transformation so that the mean of a
distribution is centered on a value 0, with a standard deviation of 1. This
makes is possible to compare distributions that have different underlying
measures such as stress-level (blood pressure) and age (in years). Average
stress-level and average age are now set to a value of zero in each
distribution. A correlation is simply the average of the cross products of
these Z-scores. Formally, this is "r = sum of [Z(A) x Z(B)] / N". Because
the sum of cross products is always maximized when the multiplied numbers are
the same, the highest value for a correlation will result when multiplying a
set of z-scores that are identical This means that each record's z-score value
on one distribution is exactly the same as it's value on the other
distribution. The magnitude of the correlation reflects the degree of
association, while the sign (+ or -) reflects the direction of change in one
value relative to the other -- i.e. a positive correlation means that the
values of both attributes change in the same direction; a negative correlation
means that as one value increases, the other decreases.
But given a correlation value, one often wants to know more about the nature
of relationship between 'A' and 'B' than what the correlation value can tell
us. Correlation does not mean causation. It also doesn't tell us which
attribute depends on the other. Does 'A' depend on 'B', or 'B' depend on
'A'? A correlation between age and stress-level does not conclusively imply
that aging causes stress, or vice versa. There are two basic approaches to
investigate causality. One is to remove or control for the presence of other
possible causal influences, the other is to determine and test for underlying
mechanisms.
An ever-present counter-claim to any suggestion of causality based on an
observed correlation is the possibility that a third variable is operating as
an underlying cause. For example, Socio-Economic Status (SES) levels may be a
third factor that is responsible for affecting both stress levels and age.
The way to address the role of a moderating attribute or a confounding
variable is to remove the influence of, or control for the effect of this
confound. In our hypothetical example, dividing records into class intervals
of SES levels, and then computing the correlations for the original values by
SES class intervals can determine if SES is a moderating factor. If it turns
out that there is little or no relationship between the original two
variables, then this third variable can be said to explain the original
finding.
This can be done mathematically, by computing what are called "partial
correlations". The correlation of 'A' and 'B' controlling for 'C' can be
easily computed knowing the component correlations between 'AB', 'AC', and
'BC'. The same technique can generalize to more than three attributes, and
is referred to as "higher order partial correlations". It is even possible
to control the effect on a third attribute from one of the other attributes
(rather than both), by computing "semi-partial correlations". These approaches
all serve to address whether another attribute is responsible for an observed
effect.
Note that a weakness of computing partial or semi-partial correlations is
that it is always possible to claim that another factor that is not even in
the data is causally responsible for the observed effect. The approach of
partial or semi-partial correlations cannot address this type of counter
argument. In this situation, proposing and investigating underlying
mechanisms that explain the observed relationship can address determining
causality. In fact, the correlation between smoking and lung cancer, and the
claim that smoking causes lung cancer were addressed using this approach of
proposing an underlying mechanism. The causal link was postulated via a
mechanism that smoking introduces toxic substances to human tissues, and these
toxins are responsible for cancer. This mechanism implied other patterns such
as "smokers have elevated rates of other respiratory diseases". Empirical
support established also via correlations ultimately made for a coherent and
persuasive argument for a toxic-smoke-mechanism as the underlying causal
factor.
In terms of data mining software, it is possible to investigate the presence
of confounding or moderating factors via the computation of partial or
semi-partial correlations whenever they may be present in the data. Given
that today's datasets are quite large with numerous attributes, this can be
routinely implemented. In terms of causal factors outside the data being
responsible for observed patterns, it is necessary for a skilled domain expert
to postulate an underlying causal mechanism(s). These causal mechanisms imply
the presence of other patterns that can then be investigated also through the
use of correlations. Taken together, it becomes possible to establish a
strong understanding of causal factors underlying an observed association.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York University's
Department of Psychology. Ed has also worked for IBM Research at the T.J.
Watson Research Center. At IBM, Ed was a member of the group that developed
Advanced Scout, the data mining application for NBA teams. His research
interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com
|