Next Article Table of Contents Previous Article

PREDICTIVE MODELS PART II: EVALUATING GOODNESS-OF-FIT
by Ed Colet

Data mining and other analytical techniques are often carried out with the intention of developing a predictive model. Last week in part I, I discussed the process of building predictive models, and the concept of overiftting and underfitting. Overfitting and underfitting are two ways that a predictive model may not fit the data well. It is important that a predictive model fit the data well as subsequent decision-making and actions are based on the predictions of the model. This week, in Part II of the column, I discuss quantitative measures used to assess the "goodness of fit" of various classes of predictive models.

Predictive models can take many forms such as a mathematical equation, a neural network, and a decision tree, to name a few. Each of these forms offer particular "goodness of fit" measures used to assess their performance. But rather than looking at these particular model-specific measures, it's more useful to look at general measures that apply to a class of models. The class can be based on the purpose of the model or in terms of the general form of the model. In this way it becomes possible to evaluate and compare the performance of different predictive models being considered.

Models for classification are popular in the business domain. Classification attempts to determine a category or class that a new incoming data record should be assigned to. An example might be the use of a model to classify a loan applicant as either likely to default (therefore deny the loan), or not likely to default (therefore approve the loan). A neural network, a decision tree, or other such model can be used for such a purpose. The performance of models used for classification purposes are typically evaluated in terms of two criteria -- accuracy and error rates.

Accuracy can be measured in two ways. One is the ratio of the number of people that were granted loans, and successfully repaid them. This can be evaluated directly from the data using either the evaluation data set, or by monitoring incoming data. The second way to measure accuracy is the ratio of the number of people denied loans that would have defaulted. While this is difficult to compute directly from the data, it can be calculated in terms of a theoretical probability. Error rates can be measure in two ways as well. First, is the ratio of the number of people granted a loan that defaulted; the second, as the ratio of the number of people denied the loan, but could have repaid it. Like the accuracy measures, the first is readily computable from the data, and the second can be computed as a theoretical probability. Whether computed from the data, or as theoretical probabilities, both accuracy and an error rate are necessary to assess the performance of a classifier.

Predictive models can also take the form of mathematical equations. A statistical linear regression model is an example of model that can be expressed as an equation -- in this case taking the form of the equation for a straight line. (Y = mX + b), where 'm' refers to the slope, and 'b' the intercept, and 'X' represents a factor used to predict a value for 'Y'. In fact, any model that can be expressed as a generalized linear model (GLM) can be represented as an equation. This includes statistical analysis of variance (ANOVA) techniques, which can be viewed as a special case of regression, which is in turn a special case of linear modeling. Goodness of fit measures for linear models is often indicated by the value for R-squared. R-squared is a ratio of the amount of variation in the data that is accounted for by the model over the total amount of variation in the data. An R-squared value of 1.0 means that all variability in the data is explained by the model (a perfect fit). A low value for R-squared indicates that very little of the variability in the data can be accounted for by the model (poor fit).

An adjusted R-squared measure is often also reported as a goodness of fit measure. An adjusted R-squared takes into account the number of factors in the model (the "'X's" in a straight line equation) that are used to predict Y. Usually, as the number of factors increase, the better the model is at predicting Y. Adjusted R-squared values take into account the number of factors that are present, and "adjust" for this making it possible to compare the performance of several different models that have different numbers of factors in them. The best fitting model is the one with the highest adjusted R-square value.

If including more terms or factors in a model usually leads to higher R-square values, should one conclude that it is always better to include as many factors in the model in order to account for as much variation in the data as possible? The answer is "no" for two reasons. First, is that parsimonious models are usually more comprehensible even if they may not fit the data as well. The second has to do with the risk of overfitting -- and is tied to the degrees of freedom associated with the model and is based on the following.

The number of factors or parameters of a model relative to the number of data points used to predict the model is another consideration in evaluating a model. This is loosely referred to as the degrees of freedom of a model. A model with too many degrees of freedom is characterized as having more parameters than there are data points used to estimate the parameters. Note that the number of data points is not necessarily the same as the number of data records. A computed average (e.g. average mount spent by high-income customers) represents a single data point that may be based on 500 records of customers that are characterized as high income. A consequence of having a model with too many degrees of freedom is that the model will be able to account or fit the data of any outcome, but it may not be useful to predict which particular outcome will occur.

In terms of degrees of freedom and as a general rule, there should be more data points than parameters. The intuition why this is so can be illustrated by the following: Imagine a single data point in (x, y) coordinates. There are an infinite number of straight lines that can be drawn that pass through this point. In other words, a linear model of this single data point will have an infinite number of degrees of freedom. Note that a straight-line equation has two parameters (slope and intercept). In this case, the number of parameters are greater than the number of data points -- and with an infinite number of possible lines, we have no way of modeling the data. As two points are needed for a straight line, a second data point is necessary to determine the straight line equation that can fit this data space. With two parameters, and two data points, our model fits the data exactly, but now we run this risk of overfitting the model. The fix is to add a third data point. Finding the best-fitting straight line through these three points will result in a line that passes close to, but not exactly through, these data points (unless the points were all perfectly aligned). Using three data points to fit two parameters gives us more data points than parameters, and a good chance to find a model that fits the data without the risk of overfitting. (In fact, this approach describes least-squares linear regression).

To recap, regardless of the particular model used, there are particular measures available to assess a model's goodness of fit to the data. Accuracy and error rates can be broadly applied to any model used for a specific business purpose -- such as classification. R-squared and adjusted R-Squared measures can be broadly used to compare many types of statistically based models with each other. And last but not least, one can rely upon the degrees of freedom of a model in evaluating its goodness of fit and predictive utility.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com.

Top of Page


Previous Article  |  Table of Contents  |  Next Article