WHAT IS THE ORIGIN OF DATA MINING?
by Patricia Carbone, The Edge
Statisticians have been using computers for decades as a means to prove or
disprove hypotheses on collected data. Linear regressions, nearest neighbors,
and other types of analyses are common. Applications dependent on statistical
analyses during the 1970s, 1980s, and 1990s include the drug approval process
for the Food and Drug Administration and the creation of credit approval
curves for credit card companies and banks. New statistical methods, including
fuzzy logic and other non-linear means of analysis, have evolved. However, the
use of statistics continues to assume that an analyst will start with a
hypothesis about the relationships among the data attributes, then use the
tool to validate or disprove that hypothesis. With data that exhibit dozens of
attributes, the methodology of the hypothesize-and-test paradigm becomes a
time-consuming process.
Another element in the development of data mining is the increasing ability
of the computer to store vast amounts of data. In the 1970s, most data storage
depended upon COBOL programs and data structures--not entirely conducive to
highly interactive analyses. Organizations can now store and query terabytes
and even petabytes of data in sophisticated database management systems. In
addition, the development of multidimensional data models such as those used
in data warehouses has allowed users to move from a transaction-oriented way
of thinking to a more analytical way of viewing the data. Human capability to
analyze all this data manually, however, even through the use of sophisticated
visualization mechanisms or on-line analytical process tools, is extremely
limited.
A final thread in data mining's development is artificial intelligence (AI);
its capabilities to analyze data were first touted during the 1970s. During
the 1980s, with continued development of AI algorithms designed to enable a
machine to learn, machine learning algorithms became realistic tools, and the
idea of pushing them to deal with larger data sets became feasible. As opposed
to statistical techniques that require the user to have a hypothesis in mind
first, these algorithms automatically analyze data and identify relationships
among attributes and entities in the data to build models that allow domain
experts (i.e., non-statisticians) to understand the relationship between the
attributes and the class. The "hypothesize-and-test" paradigm, therefore, has
now been relaxed to a "test-and-hypothesize" paradigm.
As a result of these developments, data mining flowered during the late
1990s. Retail companies eagerly applied complex analytical capabilities to
their data to increase their customer base. The financial community found
trends and patterns to predict fluctuations in interest rates, stock prices,
and economic demand. The Financial Crimes Enforcement Network (FinCEN) in the
Treasury Department built an application using sophisticated link analysis
visualization tools that has helped analysts identify over $1 trillion in
laundered money during the past eight years. These successes have contributed
to the overall popularity of data mining. They demonstrate that, rather than
requiring a human to attempt to deal with tens or hundreds of attributes, data
mining allows automatic analysis of data and the recognition of trends and
patterns.
As described in this brief history, data mining is actually the synthesis of
several technologies, including data management, statistics, machine learning
(which can include pattern recognition techniques), and visualization. Today,
data mining tools are capable of classifying data sets, associating certain
attributes or entities with other attributes or entities, segmenting the data
into similar clusters, and identifying outliers in the data. The entire
process of knowledge discovery in databases (KDD) includes collection,
abstraction, and cleansing of the data, use of data mining tools to find
patterns, validation and verification of the patterns, visualization of the
developed models, and refinement of the collection process.
The number of commercial off-the-shelf (COTS) data mining tool vendors is
shrinking, as larger, more stable companies buy out the start-ups. At the same
time, the tools are absorbing each other (e.g., a decision tree tool absorbs a
clustering technique into a tool that then provides both decision trees and
clusters). Although the tools that remain have similar capabilities, their
usability varies greatly, due to variation in the user interface,
visualization of the output pattern, ease of manipulating the specific data
mining technique variables, etc. The smaller, less capable PC versions of the
tools are truly "user friendly"; still, the user must understand basic
information about the entire KDD process, especially about validation of the
rules. With non-structured data, such as text, imagery, video, and audio, COTS
tools are very limited. They cannot accommodate, for example, newer technology
for text summarization and text mining, which seeks to integrate information
retrieval and language understanding techniques with machine learning and
statistical techniques to obtain a summary of one or more documents.
For more information, contact
Patricia Carbone at 703-883-5523 or carbone@mitre.org
MITRE is a an independent, not-for-profit company that provides technical
support to the government. Chartered to work in the public interest, it
operates Federally Funded Research and Development Centers for the Department
of Defense, the Federal Aviation Administration and the Internal Revenue
Service.
|