Next Article Table of Contents Previous Article

DETECTING THE PRESENCE OF FRAUD
by Ed Colet

Data mining technologies are one of the tools in the arsenal for battling fraud detection. Using data mining software to sift through large amounts of data, the task is to separate legitimate behaviors from fraudulent ones. In many cases and across different domains, the same approach of detecting and analyzing outliers has proved effective. What is more difficult is detecting fraudulent activity that may not be apparent as in the form of outliers. In this column, both approaches are discussed.

In many data mining routines, the detection and analysis of outliers and deviant data points are suggestive of fraudulent activity. This type of analysis has proved beneficial in several domains such as telecommunications, and financial credit industries. For example, cell phone calls are usually shorter than land-line calls due to higher costs on a per-minute basis, and/or the accumulation of roaming charges. Therefore, calls of an unusually long duration given a pattern of a person's calls can therefore be suggestive of fraudulent activity. The same type of approach can be used to detect fraudulent credit card activity -- as in the unusually large amount being charged to a credit card, relative to a pattern of smaller charge amounts. While it may be true that a long cell phone call or a large credit card charge may indeed be legitimate, in the interests of fraud detection and reduction, such deviations demand attention.

Technically speaking, a frequency distribution in which there are outliers is referred to as a compound distribution. This is because it is the result of two operative processes. One is the data generated by the regular and legitimate behavior or activity (of calls, or purchases). The second process, (the existence of fraud) generates the outlying data points. Because the second process or distribution typically has a smaller sample size (i.e. is less frequent with fewer data points), this distribution can typically have a larger variance than the other distribution. Plotting both processes in a single frequency distribution can show up as a distribution that has a peak as well as a group of deviant points separated from the main distribution. Fortunately, the detection of outlying points is easy and trivial to detect.

A subtler and more difficult problem in fraud detection is detecting fraudulent behavior that is hidden in a normal frequency distribution. If fraud is present, then there are at least two processes operating. A regular process generates a distribution of legitimate activity, and a second process generates a distribution of fraudulent activity. Thus, we again have a compound distribution -- but one whose components are not readily apparent. Instead of a normal bell shaped distribution, you have a compound distribution in which the deviant process adds its extra observations at the mode of the regular process.

An effective way to detect and decompose a compound distribution is to examine the peak of the distribution. If a deviant process were operating and contributing additional observations centered around the mean or mode of a normal distribution, then the peak of this distribution would be higher and steeper than expected. There are two approaches to examine the peak of a distribution. The first is to compute the kurtosis value of the distribution, and compare it to that of a normal distribution. Kurtosis is a statistical measure of the "peakedness" of a distribution.

A second approach is to examine the change in the slopes of the distribution. In a normal distribution, or typical "bell-shaped" curve, the point of inflection is usually located at one standard deviation from the mean. The point of inflection can be determined by examining the slope of the frequency distribution, and/or by calculating first and second derivatives. A distribution in which the point of inflection occurs within or less than one standard deviation may suggest the presence of fraud. Distributions with higher than expected kurtosis values, and/or closer than expected points of inflection can be detected and subject to closer analysis and interpretation.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com.

Top of Page


Previous Article  |  Table of Contents  |  Next Article