[ Table of Contents | NEXT ARTICLE ]

DECISION TECHNOLOGIES IN DATABASE MARKETING: PART III
by Gene M. Ferruzza, Senior VP, Decision Technologies


DEFINITION: DATA MINING AS A DISCIPLINE

Based on the current commercial use of the term "data mining," it is difficult to answer this multiple-choice question:

What is data mining?

As one might guess, the answer is "F. all of the above." In reality, data mining is a professional practice, analogous to medical or legal practice, where data miners apply a repertoire of processes, employing a variety of technologies or tools, to meet the needs of their clients. Furthermore, just as in other professional practices, any individual or firm engaging in data mining continually improves, refines, or changes methods and approaches, as a result of accumulated experience and advances in technology. Finally, to take the medical analogy further, there are frequent disputes about which methods are most effective.

Data mining also is an art; although data mining deliverables are highly measurable, many different approaches often may be used to arrive at the same deliverable. In interpreting the results of experiments, the analyst often must consider additional factors, such as intuition, experience with the dynamics of the business, and the results of other analyses. Data miners commonly follow a personal approach that combines a systematic method with an expert's "gut feeling."

What enables the data miner to operate from a combination of instinct and experience is the availability of dozens of defined techniques or processes. These proven methodologies use one or more technologies for transforming data from one form to another. Such transformations range from simply a different representation of the same data (e.g., converting customers' income into a decile range) to a complex mapping of data to newly derived information (e.g., developing an algorithm that takes customer data as input and returns the customer's probability to purchase a new product).

Finally, data mining is possible only with the proper tools. As technologies mature, so do their associated tools. When a new data-mining technology arises, existing software companies usually race to deliver the new technology within existing software products; in addition, new companies often spring up to provide products offering just the new technology. Unfortunately, despite this proliferation of tools, no single company produces a package of all the technologies, processes, and data operations needed for data mining. In fact, no one software package currently even comes close.

DEFINITION: DATA MINING TECHNIQUES AND TECHNOLOGIES

Data mining is the process of identifying new and useful information in data. I propose we use the following question as a criterion for whether an activity constitutes data mining: Does the process uncover any useful information from the data? This approach will help define the scope of data mining and extend it to include domains not currently thought of as data mining.

Data mining plays a significant role in data-mart development and maintenance and in strategy development and evaluation through decision systems. The range of data mining techniques and technologies is best illustrated by a walkthrough of the process of preparing decision technologies for database marketing program management.

Data Mining for Data Marts

It is rare for the data-mart developer to have clean, accurate, and consistent data with which to start the development process. Preparing and maintaining the data for a data mart involves many data-mining processes. Very basic, yet important, information can and should be uncovered to enhance the data's integrity and usefulness. Data mining for data marts involves two main activities: data cleansing and data representation.

By definition, the centralization of customer-related data in a data mart requires collection of data from a number of sources. For a data mart to be useful, the data need to be standardized, cleansed, and properly represented. These processes are performed initially when the data mart is created, and under a regular database maintenance program. To forecast the magnitude of these initial operations, a representative data sample often is extracted from the database. Data-mining processes are used to analyze this sample in order to estimate the time, cost, and resources required to develop and maintain the data mart.

Data representation can be closely coupled with data cleansing. For instance, customer addresses must be represented in a standard format so that duplicate customer records can be identified and removed (a data cleansing process). On the other hand, some data representation processes are used solely for knowledge discovery, such as identification of unique customers within one household. To cleanse data, we must first identify the magnitude of corrupt, missing, duplicated, and non-standardized data. Such analyses can be quite complex and time-consuming, but they are necessary if we are to assess the integrity of the data and determine what processes will be needed to improve and maintain the data.

Name and address standardization often is the starting point in the cleansing process. Data-mining analysis provides address characteristics, such as percentages of consumer vs. business addresses, number of address lines, number of P.O. boxes, and name length ranges. This information then is used to convert addresses to a standard or new format. The uniform address format enables more accurate de-duplication of records.

Duplication of records can occur for a number of reasons. Sometimes the cause is incorrect input, as when users have bypassed active search algorithms when entering customer records into a computer system. Sometimes the search does not find a customer, or the search results are not reviewed adequately. Duplication of names and addresses often results from the merging of different product databases. In such cases, matching at the customer level may be difficult because of variants of a customer's name or address. Such variation may be the result of data-entry errors, the lack of enterprise-wide data entry standards, endemic system problems, or receipt of varying information from customers.

---

Part IV of this series will be published in next week's edition of D S * .

---

Contact Gene Ferruzza at gmf@cmsnet.com


[ Table of Contents | NEXT ARTICLE ]