Next Article Table of Contents Previous Article

BUILDING PREDICTIVE MODELS: PART I
by Ed Colet

Building predictive models, Part I: The notion of underfitting and overfitting.

Data mining and other analytical techniques are often carried out with the intention of developing a predictive model. A predictive model can take many forms. It may be a mathematical equation, a decision tree, a neural network, or a flow-chart diagram. While accurate predictive models can sometimes shed light on the underlying process that generated the data, predictive models also provide one with expectations about data that have yet to be collected and stored. In part I of this column, I discuss the process of modeling -- and the notion of underfitting and overfitting. I continue next week, in part II with a discussion of "goodness of fit" measures for evaluating the fit and performance of models.

Predictive modeling is useful in any situation where a decision or action has to be taken and there are adverse consequences associated with the wrong decision. Example scenarios are an individual's decision to buy or sell stock, or an institution's decision to grant or deny a loan application. A poor decision can be costly. Predictive models can be used in these scenarios to help make a decision -- and thus serves as a risk management tool as well.

The process of developing a predictive model often involves the use of three distinct but related datasets. The first dataset is the training set. This is the data that are analyzed to first produce a model. Once a model is developed it is then tested on the evaluation data set. The second data set is an evaluation data set important for refining the model if necessary, and to get a sense of how well the model will perform on subsequent data that it will be applied on. The third data set is the data that the final model will be applied towards. One can of course, build a model without creating training and evaluation data sets. But one runs the risk of not knowing whether the model is underfitting or overfitting the data before applying it to new subsequent data. Underfitting and overfitting are two problems that will adversely affect the accuracy and usefulness of the model.

Underfitting refers to a model that is too general and fails to find interesting patterns in the data. This can result from not including important variables as inputs during the model building process. In terms of a loan application scenario, the analyst or model-builder may include annual salary as an important factor, but may exclude information about the applicant's job, which may turn out to be important. For example, jobs that are seasonal (swimming pool maintenance, landscaping, etc.) may affect the person's ability to submit monthly payments during the times when work is slower -- information that is not reflected by their annual salary. This might suggest that as much information as possible should be included as inputs during the model building process to avoid underfitting.

Including as much information as possible to develop a model can lead to the problem of overfitting. Overfitting refers to models that are too specific, or too sensitive to the particulars of the data (the training set) used to build the model. This can be due to having too many variables used as inputs and/or a non-representative training set. In the loan application scenario, if in the training set, many people that defaulted on their loans happened to be named "Smith" (a popular name), then the model (e.g. a decision tree) may decide that if the applicant's last name is "Smith", then deny the loan. In refining the model, perhaps last name should not serve as an input. More importantly, the characteristics of the data used to build the model have to be representative of the data at large -- it's unlikely that people named Smith have a disproportionately high rate of defaulting on loans.

Overfitting can be corrected using the evaluation data set. If accurate performance on the training set is due to particular characteristics in that data (e.g. person's last name), then performance will be poor on the evaluation set as long as the evaluation set does not share these idiosyncrasies. Refining the model (e.g. by pruning the decision tree) involves setting performance to be generally equivalent on both the training and evaluation data sets. This will generally give the analyst the idea of how well the model may perform on "real" data.

Overfitting can still occur despite the use of a training and evaluation data set. This can be the result of poorly creating the training and evaluation sets -- so that neither is representative of subsequent data that the model will be applied to. For example, there was a student building a model to predict stock market performance. One year's worth of data were partitioned into equal sized training and evaluation sets. A quantitative model was developed based on the training set -- and its predictive power on this dataset was impressive. It was equally impressive on the evaluation dataset, apparently requiring no refinement. But when applied to the next year's data it performed miserably despite the fact that there were not significant events related to the stock market. The reason for this is had to do with how the training and evaluation data sets were created. The training and evaluation sets were based on the daily closing values of alternate days. Day 1's close was assigned to the training set, day 2 to the evaluation set, day 3 to the training set, day 4 to the evaluation set, and so forth. As a result the overfitting that occurred when the model picked up factors tied to the temporal fluctuations in the stock market's closing values were carried over into the evaluation set as well.

To summarize, the process of model building can influence the eventual model that is developed. Underfitting and overfitting are two problems that can arise out of the process. Next week's column continues with a discussion of ways to quantitatively measure and evaluate the performance and the fit of predictive models.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com

Top of Page


Previous Article  |  Table of Contents  |  Next Article