ANALYZING HIGH-DIMENSIONAL DATA
by Ed Colet
Data mining and decision-making are typically conducted on very large data
sets typically containing numerous attributes or dimensions as well as
numerous records. As data sets grow in size, the number of possible
interactions and interdependencies among dimensions grows exponentially.
Interesting patterns in the data are often based on finding and reporting
meaningful interactions. In this column, I describe how current approaches to
finding such interactions in large data sets represent only a snapshot of a
pattern in the data -- when what may be important is to have a dynamic view of
the pattern(s). Mathematical frameworks and techniques to achieve this more
dynamic view already exist, and are generally outlined here.
There are many algorithms currently implemented for detecting patterns in
large amounts of data. Most of them share the quality that the reported result
represents a "snapshot" of a relationship among some of the attributes in the
data set. The common analysis of using Association Rules finds patterns in the
data that are expressed as rules taking the form "A then B". A customer
retention example may find a rule that the customer's age is between "26 and
35", then the retention rate is "Low". Simple rules or even more complex
patterns (involving more attributes) all share the quality that they
essentially represent a static summary description of some trend in the data.
As a static description, understanding the nature of the interaction or
dependency is therefore not clear.
In terms of high dimensional data sets, what is also important and
potentially more interesting is the discovery of complex patterns that are
revealed wherein the nature of the attributes' inter-dependencies can be
revealed and understood. Doing so requires going beyond presenting results
that represent static snapshots of the data. Fortunately, the mathematical
framework for presenting a dynamic view of high dimensional data already
exists.
Mathematically speaking, it is a basic fact that there is a one to one
mapping between the number of necessary coordinates (or numbers) and the
number of dimensions to be represented. Two coordinates (x, y) are necessary
to represent a point in 2-dimensional space. Three numbers are necessary to
represent a point in 3-dimensional space. A 4-dimensional entity such as those
of Einstein's space-time theories takes a 3- dimensional object represented
over time (a fourth dimension). Thus, by extension, there are n-coordinates
necessary to represent n-dimensional space.
One useful technique of comprehending n-dimensional spaces is to represent
it as a sequence of projections in "less-than-n-dimensions". This can be
illustrated by taking a 3-dimensional object (e.g. a chair) and viewing it as
a sequence of 2-dimensional images (e.g. it's shadow throughout the day). At
certain times (with the sun directly overhead) it may not be apparent that
that shadow is that of a chair -- and at other times the 2-dimensional shadow
will show some distorted lengths and skewed angles. But over the sequence of
views, the object can be completely recognizable and understood as a
chair.
Applying this same principle to data analysis involves the development of
appropriate "dimension reduction" algorithms applied to large data sets as
well as graphics and animation techniques so that the process of data analysis
becomes an observational science. (To some extent, techniques such as
projection-pursuit represent this approach). As we know, observation is an
integral part of scientific understanding. In terms of the cognitive aspects
of pattern perception, it is known that users can visually detect interesting
patterns and relationships in an appropriate visualization even before they're
fully understood. To return to the earlier example in customer retention --
rather than reporting that that 26-35 year olds have a low retention rate, it
would be possible for the user to actually observe possible factors in the
data that are involved in the failure to retain customers in this age group.
Detecting patterns via this type of approach differs from the current "single
snapshot" views of data, towards one in which the underlying process that
generated the data in the first place can be observed -- and ultimately better
understood.
Ed Colet is the Acting Director of Research at Virtual Gold Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com
|