DETECTING THE PRESENCE OF FRAUD
by Ed Colet
Data mining technologies are one of the tools in the arsenal for battling
fraud detection. Using data mining software to sift through large amounts of
data, the task is to separate legitimate behaviors from fraudulent ones. In
many cases and across different domains, the same approach of detecting and
analyzing outliers has proved effective. What is more difficult is detecting
fraudulent activity that may not be apparent as in the form of outliers. In
this column, both approaches are discussed.
In many data mining routines, the detection and analysis of outliers and
deviant data points are suggestive of fraudulent activity. This type of
analysis has proved beneficial in several domains such as telecommunications,
and financial credit industries. For example, cell phone calls are usually
shorter than land-line calls due to higher costs on a per-minute basis, and/or
the accumulation of roaming charges. Therefore, calls of an unusually long
duration given a pattern of a person's calls can therefore be suggestive of
fraudulent activity. The same type of approach can be used to detect
fraudulent credit card activity -- as in the unusually large amount being
charged to a credit card, relative to a pattern of smaller charge amounts.
While it may be true that a long cell phone call or a large credit card charge
may indeed be legitimate, in the interests of fraud detection and reduction,
such deviations demand attention.
Technically speaking, a frequency distribution in which there are outliers
is referred to as a compound distribution. This is because it is the result
of two operative processes. One is the data generated by the regular and
legitimate behavior or activity (of calls, or purchases). The second process,
(the existence of fraud) generates the outlying data points. Because the
second process or distribution typically has a smaller sample size (i.e. is
less frequent with fewer data points), this distribution can typically have a
larger variance than the other distribution. Plotting both processes in a
single frequency distribution can show up as a distribution that has a peak as
well as a group of deviant points separated from the main distribution.
Fortunately, the detection of outlying points is easy and trivial to detect.
A subtler and more difficult problem in fraud detection is detecting
fraudulent behavior that is hidden in a normal frequency distribution. If
fraud is present, then there are at least two processes operating. A regular
process generates a distribution of legitimate activity, and a second process
generates a distribution of fraudulent activity. Thus, we again have a
compound distribution -- but one whose components are not readily apparent.
Instead of a normal bell shaped distribution, you have a compound distribution
in which the deviant process adds its extra observations at the mode of the
regular process.
An effective way to detect and decompose a compound distribution is to
examine the peak of the distribution. If a deviant process were operating
and contributing additional observations centered around the mean or mode of a
normal distribution, then the peak of this distribution would be higher and
steeper than expected. There are two approaches to examine the peak of a
distribution. The first is to compute the kurtosis value of the distribution,
and compare it to that of a normal distribution. Kurtosis is a statistical
measure of the "peakedness" of a distribution.
A second approach is to examine the change in the slopes of the
distribution. In a normal distribution, or typical "bell-shaped" curve, the
point of inflection is usually located at one standard deviation from the
mean. The point of inflection can be determined by examining the slope of the
frequency distribution, and/or by calculating first and second derivatives. A
distribution in which the point of inflection occurs within or less than one
standard deviation may suggest the presence of fraud. Distributions with
higher than expected kurtosis values, and/or closer than expected points of
inflection can be detected and subject to closer analysis and interpretation.
Ed Colet is the Acting Director of Research at Virtual Gold
Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com.
|