[ Table of Contents | NEXT ARTICLE ]

WHEN SMALL CHANGES HAVE LARGE EFFECTS
by Ed Colet, Virtual Gold, Inc.


Data mining software can process large amounts of data and detect patterns that people hadn't known existed. For the most part, this is done is by discovering unexpected associations between attributes (variables). The unexpected association between sales of beer and diapers is an example that has been cited ad nauseam in anything written about data mining (and my apologies for now being guilty of doing it myself). The association between variables may be useful knowledge by itself, and it can be even more useful if the association turns out to also reflect a causal relationship. What this column addresses is the usefulness of discovering subtle patterns - those characterized by small changes in one attribute having a significant effect upon another one.

Consider two attributes, referred to as 'X' and 'Y'. 'Y' is the measure or variable that the business or organization wants to maximize (e.g. profit, sales, etc). In more formal terms 'Y' is the dependent variable, and 'X' is one of many other independent variables. The notion of a subtle but important pattern is one in which a small change in 'X' has a large and significant effect on 'Y'. The pattern is subtle because changes in 'X' are so small that they're not considered to be important and are therefore overlooked. It is also subtle because the relationship between 'X' and 'Y' may not be known or expected. Some examples can further clarify this:

In the economic/political domain, consider "economic growth" to be 'X' and "surplus amount" to be 'Y'. What is not immediately known is that a subtle change in 'X' can have a large effect upon 'Y'. For example, the current Clinton budget proposal assumes a projected economic growth rate of 1.7%. If this were just 0.5% higher, and compounded over a few years, this would result in trillions (!) of dollars more in a surplus that can then be used for purposes such as additional revenues for economic programs, or as tax cuts. This half-percent increase in economic growth represents a small change because economic growth of 1.7% and possibly even up to 2.2% is by historical standards very low. The point is that a small change in 'X' (growth rate) can have a surprisingly large and meaningful effect on 'Y' (surplus). The size of the effect on 'Y' is what is surprising - although the association between 'X' and 'Y' may have been known in this case.

Consider a second example drawn from the sports domain where an association between an 'X' and a 'Y' may not have been known. An NFL team's coaches were wondering why a player they drafted was not as successful in the NFL given his success in the collegiate ranks. A common measure for players playing the wide-receiver position is "yards gained after a catch". This represents our 'Y' variable, and the coaching staff was wondering why this wasn't higher. From the videotapes, the coaching staff noticed a subtlety about the player. The player had a habit of leaping slightly whenever he made a catch. By leaving his feet, the split second of time in the air before landing was giving the defensive players just enough time to prevent any possibility of the receiver gaining yards after the catch. It wasn't a problem in the collegiate game, but a problem against better defensive players in the NFL. Compounded over the course of a game, and a season, the yardage that could have been gained was significant. So, a small and subtle change (how the ball is caught) has a large effect in another variable (yards gained).

Current data mining software already does an adequate job of detecting associations, but does less well at detecting what I have referred to as subtle patterns. This is due to two things. One is that among the myriad changes in attribute values, the changes in the "right ones" must be known. Secondly, the subtle effect must be consistent.

Given the large number of attributes in a data set, software must have some way of knowing which of the subtle changes in attribute values are ultimately important. - i.e., what changes in 'X' are worth attending to? Because this is not routinely done, subtle changes in 'X' which have causal effects on 'Y' may still be overlooked. There are at least two approaches to rectify this. One is that if an association between attributes is detected, it should be presented in such a way that the subtle effect of 'X' upon 'Y' is made clearly apparent. This can be achieved through an automated re-scaling of relevant measures to present or report the change in 'X' in terms of 'Y'. For example, if a seemingly innocuous change of a fraction of a percent in 'X' will result in a 3-fold increase in 'Y', and this pattern should be reported this way, and the relationship is thus made clear to the user. The second approach to marking subtle changes is not via an automated way, but via the expertise of the domain expert/analyst. Simply knowing what possible changes may be important and providing tools that explicitly track the effect of changes in a subset of attributes can go a long way towards detecting such patterns.

In the previous examples, the small changes in 'X' are ultimately manifested as a large change in 'Y' in part due to a compounding over time. This implies that in order for data mining software to detect such subtle and causal associations, then these patterns must be consistently present over time - i.e., a small change in 'X' is not a random and spurious occurrence. If consistency is in fact the case, and there is a sufficiently large amount of data (as is usually the case in data mining), then current data mining tools that use "support" and "confidence" criteria are already well equipped to detect such patterns.

The end result is that even small but significant patterns can be brought to the attention of the user for better decision making.

---

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]