[ Table of Contents | NEXT ARTICLE ]

THE BASICS OF BAYESIAN ANALYSIS
By Ed Colet


Underlying many data mining approaches is the notion of a Bayesian analysis. A Bayesian approach is an elegant framework in which one can formally express initial beliefs in terms of prior probabilities and determine how these beliefs can/should change based on the data. In other words, a Bayesian approach shows how much one's belief should change after data are analyzed. In this column, I'll highlight the statistical basics of this analytical approach, provide an example, and note some cautions to be aware of.

Subtle but significant patterns in data can change an initial belief in a hypothesis in meaningful ways - by contradicting or re-affirming initial notions. In a Bayesian approach, it is the data that alters the odds of a hypothesis being true. This of course implies that one has to have a sense of what the odds in favor of a hypothesis might be prior to (or without) having the data. There are three elements in a Bayesian approach. These are the posterior odds, the prior odds, and a relative likelihood ratio. These ratios are expressions of probabilities that combine a hypothesis ("H"), and Data ("D"). The prior odds are one's initial belief, or the odds in favor of the hypothesis versus the odds not in favor of the hypothesis. The relative likelihood is the likelihood of patterns in data assuming the hypothesis is true versus patterns in data assuming the hypothesis to not be true. Posterior odds are equal to the product of the prior odds multiplied by the relative likelihood value. All of this can be formally expressed in the odds-ratio form of Bayes theorem, where "H" is the hypothesis, "D" represents Data, "~" denotes a negation, and P(H|D) denotes the probability of the hypothesis given the data.

Bayes theorem: Posterior odds are equal to prior odds multiplied by the likelihood ratio.

P(H|D)/P(~H|D) = P(H)/P(~H) x P(D|H)/P(D|~H)

An example can clarify things: Imagine that a large hardware chain wants to effectively adjust it's advertising campaigns because it suspects that there are different spending patterns for males vs. females. Data are collected and an analysis is undertaken. A conventional statistical analysis of spending by gender shows that if we assume spending to be equal (the null hypothesis), then the chances of getting the observed spending patterns in our data come out to only 1 in 100. Therefore we can reject the null hypothesis, and conclude that male and female spending differs. Clearly, the hardware company executives may greet this news with little excitement because it doesn't really tell them anything new - and the "devil's advocate" approach of testing a null hypothesis shows some of the limitations of what can be concluded based on this conventional statistical test.

In contrast, a Bayesian analysis can look at the issue in this way: Because the executives suspect that males spend more than females, their initial belief in the odds that this is true are 2:1 in favor. So, if this belief is true, what's the probability of getting the patterns observed in the data? And would the results change this initial belief? If the hypothesis is true, let's assume that the odds of getting our observed data turn out to be 3 in 100. The relative likelihood is therefore 3:1 (3/100 divided by 2/100 from earlier). By plugging values into the Bayes expression, we multiply the prior odds (2:1), by the relative likelihood value (3:1), and compute our posterior odds to be 6:1 in favor of "H". The data have raised the odds from 2:1 in favor of "H" to 6:1. The change in odds is three fold (or equivalent to the relative likelihood value). In terms of the hardware company's marketing campaign, this analysis shows that not only were the executives' initial suspicions correct, but that the data suggest that the trend is even stronger than they had suspected. A marketing campaign effort heavily directed to males may be in order.

Some cautions should be noted with regard to values that enter into the computations. Note what happens if the probability of a hypothesis is equal to zero. Then the posterior odds will always be zero. In other words, if a hypothesis is initially thought to be false, then no data can ever change the initial beliefs! A response to this "flaw" is that prior probabilities should never be based on opinion, but based on prior data analysis, in which P(H) is not likely to turn out to be exactly zero. But despite this, it's possible that prior probabilities can be difficult to estimate accurately. Note also that this approach involves a comparison between posterior and prior probabilities. A large difference between these only indicates that there is a discrepancy between what one initially believed, and what is suggested by the data. This may be independent of whether the effect in the data is statistically or practically significant. Despite these cautions, a Bayesian approach is a useful analytical framework for both analyzing and interpreting patterns in data.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]