BUILDING PREDICTIVE MODELS: PART I
by Ed Colet
Building predictive models, Part I: The notion of underfitting and
overfitting.
Data mining and other analytical techniques are often carried out with the
intention of developing a predictive model. A predictive model can take
many forms. It may be a mathematical equation, a decision tree, a neural
network, or a flow-chart diagram. While accurate predictive models can
sometimes shed light on the underlying process that generated the data,
predictive models also provide one with expectations about data that have
yet to be collected and stored. In part I of this column, I discuss the
process of modeling -- and the notion of underfitting and overfitting. I
continue next week, in part II with a discussion of "goodness of fit"
measures for evaluating the fit and performance of models.
Predictive modeling is useful in any situation where a decision or action
has to be taken and there are adverse consequences associated with the wrong
decision. Example scenarios are an individual's decision to buy or sell
stock, or an institution's decision to grant or deny a loan application. A
poor decision can be costly. Predictive models can be used in these
scenarios to help make a decision -- and thus serves as a risk management
tool as well.
The process of developing a predictive model often involves the use of three
distinct but related datasets. The first dataset is the training set. This
is the data that are analyzed to first produce a model. Once a model is
developed it is then tested on the evaluation data set. The second data set
is an evaluation data set important for refining the model if necessary, and
to get a sense of how well the model will perform on subsequent data that it
will be applied on. The third data set is the data that the final model
will be applied towards. One can of course, build a model without creating
training and evaluation data sets. But one runs the risk of not knowing
whether the model is underfitting or overfitting the data before applying it
to new subsequent data. Underfitting and overfitting are two problems that
will adversely affect the accuracy and usefulness of the model.
Underfitting refers to a model that is too general and fails to find
interesting patterns in the data. This can result from not including
important variables as inputs during the model building process. In terms
of a loan application scenario, the analyst or model-builder may include
annual salary as an important factor, but may exclude information about the
applicant's job, which may turn out to be important. For example, jobs that
are seasonal (swimming pool maintenance, landscaping, etc.) may affect the
person's ability to submit monthly payments during the times when work is
slower -- information that is not reflected by their annual salary. This
might suggest that as much information as possible should be included as
inputs during the model building process to avoid underfitting.
Including as much information as possible to develop a model can lead to the
problem of overfitting. Overfitting refers to models that are too specific, or
too sensitive to the particulars of the data (the training set) used to build
the model. This can be due to having too many variables used as inputs and/or
a non-representative training set. In the loan application scenario, if in
the training set, many people that defaulted on their loans happened to be
named "Smith" (a popular name), then the model (e.g. a decision tree) may
decide that if the applicant's last name is "Smith", then deny the loan. In
refining the model, perhaps last name should not serve as an input. More
importantly, the characteristics of the data used to build the model have to
be representative of the data at large -- it's unlikely that people named
Smith have a disproportionately high rate of defaulting on loans.
Overfitting can be corrected using the evaluation data set. If accurate
performance on the training set is due to particular characteristics in that
data (e.g. person's last name), then performance will be poor on the
evaluation set as long as the evaluation set does not share these
idiosyncrasies. Refining the model (e.g. by pruning the decision tree)
involves setting performance to be generally equivalent on both the training
and evaluation data sets. This will generally give the analyst the idea of
how well the model may perform on "real" data.
Overfitting can still occur despite the use of a training and evaluation
data set. This can be the result of poorly creating the training and
evaluation sets -- so that neither is representative of subsequent data that
the model will be applied to. For example, there was a student building a
model to predict stock market performance. One year's worth of data were
partitioned into equal sized training and evaluation sets. A quantitative
model was developed based on the training set -- and its predictive power on
this dataset was impressive. It was equally impressive on the evaluation
dataset, apparently requiring no refinement. But when applied to the next
year's data it performed miserably despite the fact that there were not
significant events related to the stock market. The reason for this is had
to do with how the training and evaluation data sets were created. The
training and evaluation sets were based on the daily closing values of
alternate days. Day 1's close was assigned to the training set, day 2 to
the evaluation set, day 3 to the training set, day 4 to the evaluation set,
and so forth. As a result the overfitting that occurred when the model
picked up factors tied to the temporal fluctuations in the stock market's
closing values were carried over into the evaluation set as well.
To summarize, the process of model building can influence the eventual model
that is developed. Underfitting and overfitting are two problems that can
arise out of the process. Next week's column continues with a discussion of
ways to quantitatively measure and evaluate the performance and the fit of
predictive models.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York University's
Department of Psychology. Ed has also worked for IBM Research at the T.J.
Watson Research Center. At IBM, Ed was a member of the group that developed
Advanced Scout, the data mining application for NBA teams. His research
interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com
|