TIME SERIES PART II: THE DATA MINING APPROACH
Ed Colet
The goal of data mining is to discover meaningful patterns in timeseries data. Given a set of time-series data, there are two general approaches used in data mining. One is to utilize a template to find matching segments in the data that match this template. The other is to use one series to find similar subsequences that may exist in another. Examples of both are outlined here.
Dynamic time warping is a relatively recent template matching technique applied to time-series analysis (although it's derived from the speech recognition research area). Mathematical details are omitted here, but the approach is intuitively simple. Imagine that one can stretch and/or compress the time dimension of the template until the template matches the time-series data. The operation proceeds by creating a two dimensional grid. Each grid point defined by (x, y) coordinates corresponds to an alignment between the elements of the time-series data and elements of the template. As an extreme example, if there's a perfect 1:1 correspondence between the data time series and the template series, then the resulting x, y coordinates will form a diagonal straight line because each x is equal to each y value. But perfect matches are the exception rather than the rule, so aligning the data points usually results in some sort of wavering path rather than a straight line. Dynamic time warping compresses and/or stretches the time dimension of the template so that the distance of this line or warping path is minimized.
Dynamic time warping has some nice properties, but some cautions may also be warranted. Because the time axis of one's template can be stretched and compressed in various places it has the benefit that the same template can be used to match a variety of time series patterns - all one needs to do is find the appropriate warping operations. But by warping the time dimension, you're also altering what may be the most important aspect that drives the data. Compressing it here, and stretching it there "unsystematically" changes the nature of the data movement in ways that are likely to be independent of the process that generated the data one is trying to understand. One could also argue that the whole notion of template matching could limit knowledge discovery because you're limited to looking for patterns that are previously defined, and therefore already known. Consequently, discovering potentially interesting patterns for which pre-defined templates don't exist is difficult.
Relying on a template isn't necessary for the data mining task of discovering subsequences in a time series. Imagine that there is one data series A, and another data series B. Subsequence matching looks for matching segments between series A and B. Hidden and unknown patterns in the time series can be discovered and matched based on values for four parameters: (1) Epsilon, (2) Gap, (3) Window size, and (4) Matching length. (These are terms used in one particular product and not standard across products).
The four parameters are defined as follows. Epsilon is the relative tolerance range for matching values in one series with another. Think of it as an interval around each value of series A, for which values of series B can vary, yet still be considered a close enough match. Values beyond the epsilon parameter are outliers. Gap is the number of consecutive time units for which outliers are allowed to exist. If the gap is set too low, possibly similar series aren't discovered. Set the gap too high, and different series are erroneously considered similar. Window size is the atomic sequence for matching. Outliers are not allowed within this window size. Longer subsequences can be discovered by "stitching together" these atomic sequences. Matching length is the shortest length of subsequences that are to be output as matches. Set this too low, and many similar sequences may output. Set it too high, and less similar time sequences may be output.
These 4 parameters are mutually dependent and setting their values appropriately is critical. Changing these parameters can lead to vastly different results. It is recommended that the user-analyst first use a well-known data set and interactively adjust the parameters until a desired fit is achieved. These parameter values can then be applied to the analysis of a new data set. This is similar to the approach of partitioning a portion of the data to use as a training-set and then applying the results from the training set to the remainder of the data - popular for developing neural network models. But unlike neural network models, the training session is not automated but requires that the analyst/user closely interact with the data. As such substantial analytical expertise as well as knowledge of the domain are required.
Ultimately, it's possible that a data mining approach to time-series analysis can lead to the discovery of hidden but interesting patterns in time-series data.
Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.
For more information, see http://www.virtualgold.com.