[ Table of Contents | NEXT ARTICLE ]

TIME SERIES ANALYSIS: THE STATISTICAL APPROACH, PART I
by Ed Colet


Time series analysis is an analytical technique that is broadly applicable to a wide variety of domains. In domains in which data are collected at specific (usually equally spaced) intervals a time series analysis can reveal useful patterns and trends related to time. Industrial as well as governmental agencies rely on time series analysis for both historical understanding as well forecasting and predictive modeling. In this week's column, the basics of the statistical approach to an analysis of time series are presented. Next week, in part II, we'll look at what commercial data mining applications offer with regard to time series analyses.

A time series is mathematically defined as a set of observed values taken at specified times. The set of values is typically denoted "Y", and the set of times as "t1, t2, t3, ... etc". In other words, Y is a function of "t", and the goal of a time series analysis is to find a function that describes the movement of data. It should be noted that a time series is often graphed, and trends and patterns are visually apparent. The statistical approach essentially describes visually apparent trends formally and quantitatively.

A time series analysis is often referred to as "time series decomposition". As the phrase suggests, this means that the time series is decomposed into its component parts. There are typically four main components and together these components sufficiently describe the variations of data over time. These four components are (1) the long-term trend, "T"; (2) seasonal variations, "S"; (3) cyclical patterns, "C"; and (4) irregularities or noise, "I". As an equation, the time series is generally described either as a multiplicative relationship where

          Y = T x C x S x I,

or alternatively as an additive model where

          Y = T + C + S + I.

In practice, which approach to use is decided by the fit to the data, rather than apriori justifications.

The long-term trend, "T" refers to the general direction over a long interval of time. The long-term trend can be determined by several methods. If the data appear to be linear, fitting a regression line with the method of least squares is a popular approach. Other approaches are to capture and summarize the long-term trend by computing a moving average, or a semi-average. Each of these techniques serve to essentially "smooth out" much of the data, so that only the long-term trend becomes apparent. If the long-term trend is linear, then a straight-line regression equation can describe the trend. If the trend is curvilinear, curve-fitting techniques can be used to formulate a suitable equation that describes the curve

Seasonal variations are the almost identical patterns that a series follows during particular periods (e.g. months) of successive years. These variations are due to recurring events, such as the Christmas shopping period in a time series of retail sales spanning several years. Formulating a seasonal index captures seasonal variations. Not surprisingly, there are various methods to compute this index. The purpose of the index value is to estimate how the data in the series varies from month to month in a typical year. A seasonal index shows the relative values of the variable during the months of the year. By dividing the original monthly data by the corresponding seasonal index values, the resulting data are now adjusted for seasonal variation.

Cyclical patterns refer to long-term swings about a trend line or curve. These movements are considered to be cyclical only if they recur after time intervals of more than a year. Examples of cyclical movements are the business cycle of prosperity, recession, depression, and recovery. It is typically the case that after identifying long-term trend and adjusting the data for seasonality, the remaining variations of the data are essentially due to cyclical changes or irregularities. In other words, "Y/TS = CI" (assuming a multiplicative model). The irregularities in the data can be smoothed out with a type of smoothing operation (e.g. a moving average as mentioned earlier). These random irregularities or noise in the data follow a Gaussian distribution in which large random deviations are quite rare, while small random perturbations are more frequent. At this point, only cyclical variations are left in the data and these can then be studied in detail if necessary.

To summarize, the statistical approach of time-series decomposition is a sequence of operations that identify the variations in the data over time. The full description of the series describes movements at various levels of detail - from a description of a high level trend, to capturing and understanding cyclical and seasonal variations, and identifying random perturbations.

In part II, we'll look at some interesting similarities and differences afforded by a data mining approach to a time series analysis.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]