Next Article Table of Contents Previous Article

ANALYZING HIGH-DIMENSIONAL DATA
by Ed Colet

Data mining and decision-making are typically conducted on very large data sets typically containing numerous attributes or dimensions as well as numerous records. As data sets grow in size, the number of possible interactions and interdependencies among dimensions grows exponentially. Interesting patterns in the data are often based on finding and reporting meaningful interactions. In this column, I describe how current approaches to finding such interactions in large data sets represent only a snapshot of a pattern in the data -- when what may be important is to have a dynamic view of the pattern(s). Mathematical frameworks and techniques to achieve this more dynamic view already exist, and are generally outlined here.

There are many algorithms currently implemented for detecting patterns in large amounts of data. Most of them share the quality that the reported result represents a "snapshot" of a relationship among some of the attributes in the data set. The common analysis of using Association Rules finds patterns in the data that are expressed as rules taking the form "A then B". A customer retention example may find a rule that the customer's age is between "26 and 35", then the retention rate is "Low". Simple rules or even more complex patterns (involving more attributes) all share the quality that they essentially represent a static summary description of some trend in the data. As a static description, understanding the nature of the interaction or dependency is therefore not clear.

In terms of high dimensional data sets, what is also important and potentially more interesting is the discovery of complex patterns that are revealed wherein the nature of the attributes' inter-dependencies can be revealed and understood. Doing so requires going beyond presenting results that represent static snapshots of the data. Fortunately, the mathematical framework for presenting a dynamic view of high dimensional data already exists.

Mathematically speaking, it is a basic fact that there is a one to one mapping between the number of necessary coordinates (or numbers) and the number of dimensions to be represented. Two coordinates (x, y) are necessary to represent a point in 2-dimensional space. Three numbers are necessary to represent a point in 3-dimensional space. A 4-dimensional entity such as those of Einstein's space-time theories takes a 3- dimensional object represented over time (a fourth dimension). Thus, by extension, there are n-coordinates necessary to represent n-dimensional space.

One useful technique of comprehending n-dimensional spaces is to represent it as a sequence of projections in "less-than-n-dimensions". This can be illustrated by taking a 3-dimensional object (e.g. a chair) and viewing it as a sequence of 2-dimensional images (e.g. it's shadow throughout the day). At certain times (with the sun directly overhead) it may not be apparent that that shadow is that of a chair -- and at other times the 2-dimensional shadow will show some distorted lengths and skewed angles. But over the sequence of views, the object can be completely recognizable and understood as a chair.

Applying this same principle to data analysis involves the development of appropriate "dimension reduction" algorithms applied to large data sets as well as graphics and animation techniques so that the process of data analysis becomes an observational science. (To some extent, techniques such as projection-pursuit represent this approach). As we know, observation is an integral part of scientific understanding. In terms of the cognitive aspects of pattern perception, it is known that users can visually detect interesting patterns and relationships in an appropriate visualization even before they're fully understood. To return to the earlier example in customer retention -- rather than reporting that that 26-35 year olds have a low retention rate, it would be possible for the user to actually observe possible factors in the data that are involved in the failure to retain customers in this age group. Detecting patterns via this type of approach differs from the current "single snapshot" views of data, towards one in which the underlying process that generated the data in the first place can be observed -- and ultimately better understood.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com

Top of Page


Previous Article  |  Table of Contents  |  Next Article