CLUSTERING AND CLASSIFICATION: DATA MINING APPROACHES
by Ed Colet
Two common data mining techniques for finding hidden patterns in data are
clustering and classification analyses. Although classification and
clustering are often mentioned in the same breath, they are different
analytical approaches. In this column, I describe similarities and differences
between these related, but distinct approaches.
Imaging a database of customer records, where each record represents a
customer's attributes. These can include identifiers such as name and
address, demographic information such as gender and age, and financial
attributes such as income and revenue spent. Clustering is an automated
process to group related records together. Related records are grouped
together on the basis of having similar values for attributes. This approach
of segmenting the database via clustering analysis is often used as an
exploratory technique because it is not necessary for the end-user/analyst to
specify ahead of time how records should be related together. In fact, the
objective of the analysis is often to discover segments or clusters, and then
examine the attributes and values that define the clusters or segments. As
such, interesting and surprising ways of grouping customers together can
become apparent, and this in turn can be used to drive marketing and promotion
strategies to target specific types of customers.
There are a variety of algorithms used for clustering, but they all share
the property of iteratively assigning records to a cluster, calculating a
measure (usually similarity, and/or distinctiveness), and re-assigning records
to clusters until the calculated measures don't change much indicating that
the process has converged to stable segments. Records within a cluster are
more similar to each other, and more different from records that are in other
clusters. Depending on the particular implementation, there are a variety of
measures of similarity that are used (e.g. based on spatial distance, based on
statistical variability, or even adaptations of Condorcet values used in
voting schemes), but the overall goal is for the approach to converge to
groups of related records.
Classification is a different technique than clustering. Classification is
similar to clustering in that it also segments customer records into distinct
segments called classes. But unlike clustering, a classification analysis
requires that the end-user/analyst know ahead of time how classes are defined.
For example, classes can be defined to represent the likelihood that a
customer defaults on a loan (Yes/No). It is necessary that each record in the
dataset used to build the classifier already have a value for the attribute
used to define classes. Because each record has a value for the attribute
used to define the classes, and because the end-user decides on the attribute
to use, classification is much less exploratory than clustering. The
objective of a classifier is not to explore the data to discover interesting
segments, but rather to decide how new records should be classified -- i.e. is
this new customer likely to default on the loan?
Classification routines in data mining also use a variety of algorithms --
and the particular algorithm used can affect the way records are classified. A
common approach for classifiers is to use decision trees to partition and
segment records. New records can be classified by traversing the tree from
the root through branches and nodes, to a leaf representing a class. The path
a record takes through a decision tree can then be represented as a rule. For
example, "Income<$30,000 and age<25, and debt=High, then Default
Class=Yes).
But due to the sequential nature of the way a decision tree splits records
(i.e. the most discriminative attribute-values [e.g. Income] appear early in
the tree) can result in a decision tree being overly sensitive to initial
splits. Therefore, in evaluating the goodness of fit of a tree, it is
important to examine the error rate for each leaf node (proportion of records
incorrectly classified). A nice property of decision tree classifiers is that
because paths can be expressed as rules, then it becomes possible to use
measures for evaluating the usefulness of rules such as Support, Confidence
and Lift to also evaluate the usefulness of the tree.
To conclude, although clustering and classification are often used for
purposes of segmenting data records, they have different objectives and
achieve their segmentations through different ways. Knowing which approach to
use is important for decision-making.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com
|