[ PREVIOUS ARTICLE | Table of Contents | NEXT ARTICLE ]

DECISION TECHNOLOGIES IN DATABASE MARKETING: PART IX
by Gene M. Ferruzza, Senior VP, Decision Technologies


Data Representation

Most data-mining application software packages do not use the data-mart environment directly. They usually operate on flat ASCII files or in proprietary binary environments. So once a sample is extracted, the dataset usually needs to be converted into a format compatible with the modeling tool. For most modeling technologies, the data need to be represented in a numeric format. Unfortunately, the data mart rarely stores all data numerically. The modeler must make sure that each customer characteristic is represented properly for use in modeling.

The most common issue in data representation is treatment of categorical variables and dates. Categorical variables, such as state, ZIP Code, vehicle type, product, gender, and marital status, are converted to what is known as "1-of-N" codes. For example, the marital status for an individual record in the data mart may take on the value of MARRIED, SINGLE, DIVORCED, or NOT KNOWN. The resulting representation for modeling purposes is 1-0-0-0, 0-1-0-0, 0-0-1-0, or 0-0-0-1, respectively. Each individual record is converted to a pattern of 1's and 0's representing the individual's marital status. Dates often are converted to a Julian-day representation. A common Julian representation is to convert all dates to the number of days since January 1, 1900. This representation resolves inconsistencies in date formats and allows the modeler to easily compare any date with any other, numerically.

Exploratory Data Analysis

Exploratory data analysis (EDA) is the process of manipulating and analyzing data to uncover information or transform the data into a form of information easily understood and used by the modeler (and the modeling process). Most of the time and expertise in any modeling project is consumed in EDA. The appropriate EDA and the specific methods used may differ from one modeling technology or methodology to another--and from one modeler to another.

Various modeling technologies require that the data be transformed from their original representation. In logistic regression, for example, it is common for a continuous numeric variable to be "binned" (i.e., sorted into categories, or "bins"). So a customer characteristic such as INCOME may need to be transformed into an income category. The categories may be based on statistical tests or domain expertise (i.e., industry standards). Through statistical tests, the modeler may find that the target behavior is related in some way to whether annual income is below $20,000 or above $75,000. This categorization would create three bins: $0 to $19,999, $20,000 to $74,999, and _ $75,000. The resulting representations of the bins for income would be 0-0-1, 0-1-0, and 1-0-0, respectively.

The modeler has available a large suite of statistical and data-transformation operations that can be used to expose information in the data and to present the data in the best way for the modeling technology being used. Each modeler usually forms a personal approach to EDA and refines it over time and through experience.

In addition to uncovering information in the data, the modeler must consider the need for normalization and simplification of the data.

Many customer characteristics must be normalized to eliminate bias resulting from outliers (isolated extreme values) in the data. The effect of an outlier is apparent when we consider a customer characteristic like INCOME. For example, consider a data mart with a million customers, one of whom is Bill Gates. The average INCOME attribute is $34,000 annually, but Mr. Gates's income is much larger.

We often scale variables like INCOME to a range conducive to the modeling technology being used. If a neural network algorithm is used to develop the model, INCOME needs to be scaled to a range of -1.0 to 1.0. If the scale were linear, virtually all customers would scale to approximately -0.9999, and Mr. Gates would scale to 1.0. Most information about income across the customer base would be lost. Many think the world consists of Bill Gates and then the rest us; however, this view is not useful for modeling. Normalization functions, such as Z score or sigmoid function, can scale the data so that the results are not skewed by outliers.

Part X of this series will appear in the next edition of D S * .

Gene Ferruzza may be contacted at gmf@cmsnet.com


[ PREVIOUS ARTICLE | Table of Contents | NEXT ARTICLE ]