DETECTING MEANINGFUL SUBSTITUTION PATTERNS
by Ed Colet
Consumer purchase behaviors are always linked to choices that are made. Substituting one product or item for another is a frequent choice that is made. Apples may be substituted for pears, or chicken for beef. Although current data mining algorithms can readily find associations among items, such as milk and cereal, what may also be important is discovering items that are substitutes for each other. Compared to identifying associations, identifying substitutions are not as straightforward to detect. Yet finding substitutions is important for several reasons; and detectable with current analytical approaches.
A substitution can be defined as the choice made by a consumer to replace the purchase of one product for another, given that this differs from their history of prior purchases. A straightforward example of a substitution would be the choice to purchase apples rather than pears. A less obvious example, but just as valid might be the choice to purchase English muffins, rather than yogurt.
There are several reasons why it is important to detect substitutions of products. To continue our supermarket example, a retailer could use knowledge of substitution patterns to manage inventory. If two products are essentially substitutes, the retailer could stock up on the more favored item.
In the arena of online transactions via websites, detecting substitution behaviors is also important. Given the plethora of web sites on the web, differentiating your site favorably is important in retaining users and sustaining visits. If consumers have a history of visiting your site, and then switch to a competitors', this has implications for brand loyalty and the differentiation of your site relative to your competitors. Detecting such substitution patterns, and the underlying reasons for them are important.
Substitution patterns are also directly relevant to churn management. Churn is defined as the rate of turnover of customers to a competitor. Churn is essentially the manifestation of consumers' substitution patterns. Regardless of how the rate is computed, a low rate is desirable and can be achieved in part through the detection of substitution patterns and reducing their occurrence.
Detecting substitution patterns are also important in economic indicators. The popular Consumer Price Index (CPI) is a monthly indicator of inflation. It measures what consumers pay for a given basket of items. As such, if consumers choose to substitute a cheaper item for a more expensive one this behavior isn't reflected in the CPI, and therefore the inflation indicator is not truly accurate. A less common measure, called the Personal Consumption Expenditure (PCE) deflator does allow for substitutions, and helps qualify the CPI measure.
Given the definition of what might constitute a substitution, and various ways that they're important, the practical question is how to reliably detect substitution patterns. There are at least three ways that can commonly be used to begin the detection of substitutions: (1) statistically with a negative correlations; (2) through modified association rules, and (3) sequential patterns over a time sequence.
Statistically speaking, a negative correlation value can indicate substitutions. For example, high sales of chicken and low sales of beef may indicate a substitution pattern for a meal choice -- especially if there is a corresponding way to account for this such as the rise in the price of beef. If an "odd" pair of products (ones which don't share an overt similarity feature), such as English muffins and yogurt, are in fact substituted for one another this should still be evident if it occurs in a large enough sample of cases.
Discovering Association Rules is a data mining function to detect patterns of items purchased together. These associations are reported in the form of rules (A->B). In some cases it will be readily apparent that it is the negation of the rule that expresses a stronger pattern. Then given a negative association, we can then continue on to infer substitutions.
A third way to discover substitutions is to conduct a time series analysis looking for sequential patterns. This would indicate changes and shifts from a history of a consumer's behavior that may reflect substitution choices.
The use of clustering in conjunction with the above techniques could also be helpful in detecting substitutions. By looking for similar features that relate two items of a suspected pair of substituted products we can infer the reasons for their substitution. For obvious items, apples for pears, they can both substitute for each other as fruit items. For less obvious items, clustering may reveal that English muffins and yogurt may be similar after all in that they're both possible healthy breakfast items.
So, substitutions are important patterns that can be meaningful in a variety of ways, and possible to detect using current analytical tools. The interpretation of such patterns, like any other data mining results would then lead to insights.
Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.
For more information, see http://www.virtualgold.com.