[ Table of Contents | NEXT ARTICLE ]

ON THE CONTRIBUTION OF PARTS TO A WHOLE
by Ed Colet, Virtual Gold, Inc.


Current technologies make it relatively easy and cost effective to collect and store large amounts of data. A consequence of this is that large databases now have more and more fields. When analyzing such data sets, there are then too many variables or attributes to conveniently analyze "by hand". This has given rise to technologies such as data mining - in which significant patterns that are made up of unexpected but meaningful combinations of attributes can be automatically discovered. But having more attributes means more combinations of such attributes become possible. Discovering an interesting combination of attributes is one thing - but knowing which of the particular attributes within a pattern are most (or least) important is another.

This weekend, the National Football League (NFL) will have conducted their annual draft session. The draft is the selection of new players drawn from the college ranks by professional NFL teams. Deciding whom to select is a decision not to be taken lightly - since the choice represents a multi-million dollar, (hopefully) long-term, investment in a player by a team. So, prior to draft day a lot of research has been done. A player's collegiate career statistics are extensively analyzed and video footage of his games is extensively viewed. During the NFL-wide scouting camp (the Combine), players sprint-times, strength tests, and other physical measurements are taken. Teams interested in particular players ask that they come in for private workouts as well. Height, weight, etc are measured and re-measured. Personality questionnaires and other paper and pencil evaluations such as the Wunderlich Intelligence Test are also utilized. All of this is intended to provide indicators as to whether the team should select the player on draft day and may recoup their investment via more wins, more ticket sales, and ultimately a Super Bowl championship.

In the final days before the selections, it's interesting to note which aspects about a player are considered. Beyond the routine considerations of strength and quickness, more tangential aspects receive attention. Aspects that apparently seem to have little to do with the performance on the field such as whether the player braids his hair or has tattoos are noted. The New York Times quotes a former scout, Russell Lande who notes that these days, "more and more teams are paying attention to the things that count the least. . .". In general, devoting attention to things that don't count isn't productive.

In the context of data mining, sets of attributes that affect a measure of interest are automatically discovered by the software. For example, a predictive data-mining model of the "number of sales" may discover that the "discount amount" and the "temperature of the day" are relevant attributes. But are these attributes equally relevant? How should one act upon this information? Oftentimes we know what attributes are important, but not how or why this is so. In some cases, the circumstances can dictate how to treat an attribute. In this example, the discount amount can be acted upon. Temperature is not controllable in the same way - although one may decide to set prices based upon the temperature, it's not known whether it would be worth doing this without knowing the extent that temperature affects sales. For more complex patterns that contain more attributes, actions may not be clear.

One solution would be to borrow from techniques in statistics. In a regression model to predict the number of sales, the relative contribution of "temperature" and "discount amount" can be determined. The resulting regression equation's coefficients for each of these parameters tell us how much the "number of sales" will change for each unit change in "temperature" and each unit change in "discount amount". The overall contribution of the model or the proportion of variance accounted for by these parameters is also readily computable and typically expressed as the R-squared value. This is one reason that regression is so powerful. But it may not be possible to develop an analogous way to measure the contribution of an attribute in certain data mining situations. For example, it may be that accounting for variance isn't the appropriate aspect to account for and/or the assumptions necessary for a regression approach may not be present in the data mining context. In certain data mining techniques, e.g., neural networks, the contribution of each attribute used as a feature during training can not even be determined once the network is trained and used for decision making.

Nevertheless, whenever possible and appropriate, the measure of how a particular attribute contributes to a pattern should be reported. It then makes it clearer to interpret and act upon what's discovered. Considering braided hair or tattoos in the context of football wouldn't seem so odd after all.

---

Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]