Analysis & Commentary:
DATA PREPARATION AND DATA ANALYSIS
by Ed Colet
Underlying a successful data analytic or data mining application are two
general phases. Data preparation is a phase that addresses the collection and
storage of data. Data analysis involves the analysis of data once the storage
and representation issues have been resolved. These two phases require
distinct sets of technical skills. Surprisingly enough, despite their
interdependence the required skills often requires different and separate
teams of people, that can introduce unnecessary delays unless they work in
concert. But perhaps there is a better approach.
Data preparation is a critical part of the process before any data mining or
data analysis can occur. Often times, this is where the bulk of development
time associated with developing a data mining application. Data preparation
involves taking raw data from the operational or transactional systems and
moving it to data storage systems such as a large scale warehouse. The data
in the storage systems are usually further processed to create the appropriate
sets of relational tables, and logical views.
The particular architecture for data storage is driven primarily by two
factors: One factor is based on the subsequent analysis that will be
conducted on the data. An example of this is the design of OLAP cubes for the
"slicing and dicing" of data typical of an OLAP analysis. A poor design means
the data and the dimensions are not represented optimally for the appropriate
types of summarization. In the worst case, summary computations and results
are not possible -- and how the data are stored has to be reworked. In other
cases, it's processing time and response times that suffer but summary reports
are eventually output.
Processing time and response time is another factor that can drive the design
of the database in the data preparation phase. Deciding to implement several
tables as opposed to a single table containing all data, deciding how distinct
tables are to be related to each other, what attributes should serve as
primary keys within each of the tables, what tables should be created as
logical views, etc. etc. are all design issues that a database designer works
through. Oftentimes, the motivation for such design decisions is to minimize
the amount of disk space and/or query response time. Optimization of query
response time is based on designing for the most frequent queries that can be
expected which may not be the same as the processing associated with
sophisticated data analysis for model building.
After the data preparation and storage issues have been worked out, the data
in a position to be analyzed. The objective of data analysis is to look for
systematic patterns in the data that can impact the organization's business
practices. Data analysis can take many forms from simple queries that count
totals and sub-totals to more sophisticated model building analysis such as
regression equations.
While SQL is the language commonly used to store and manipulate the data
during the data preparation phase, SQL is also the language used for certain
types of data analysis -- such as frequency counts and sub-totals. But it is
generally acknowledged that SQL has performance limitations when used to
analyze large datasets -- even if the underlying analysis is a simple
frequency count. In order to be able to develop more sophisticated models on
large data sets, specialized statistical analytic tools are often employed.
Data analysts often use other programming languages such as SAS or S-plus,
rather than SQL for statistical model building. If the data are still kept
within the data tables, it is necessary for the analyst to fully understand
the database architecture and the underlying tables. Or, the data analyst may
extract and export the data from the tables into a separate statistical
analytic application such as SAS for statistical analysis.
Herein lies a potential problem. If the data are exported to a statistical
application for analysis, much of the careful work carried out during the
database design and data storage phase does not help during the data analytic
phase. If on the other hand, the data remain within the data tables for
analysis but the analyst does not have the necessary DBA skills, then two
persons (or two teams of people) are required to build the data
analytic/mining application. Data preparation/data storage skills are often
separate from the data analytic skills. Because of this disjoint there can be
productivity delays and disruptions in the application building process.
This separation of skill sets can be bridged by ensuring that DBA's can
themselves build the sophisticated statistical models by acquiring the
necessary skills in statistical analysis. Alternatively, the analysts could
acquire the skills of the DBA's so that they're familiar with the database
tables and can manipulate and create necessary views to support their
analysis. But in practical terms, it's unlikely to expect that DBAs become
equally adepts as statisticians; it is also unlikely to expect that analysts
become equally familiar with database architectures and database design
issues.
A more realistic solution would be to ensure that the application building
tools are easy enough to use by the end-user (whether that end-user be the
DBA, the analyst, or someone else) so that both the data preparation aspects
and the data analytic operations can be carried out in concert with each
other. This means that the representation and storage of data be optimized
for system performance while being compatible with subsequent data analytic
routines. Virtual Gold's VirtualMiner Enterprise (VME) for application
building is an approach designed in part along these lines.
Ed Colet is the Acting Director of Research at Virtual Gold
Inc.,
responsible for developing analytical methods for data mining and for
investigating human factors and usability issues of business intelligence
systems. At present, he is in the final stage of completing a doctoral
dissertation in the Cognition and Perception program at New York
University's Department of Psychology. Ed has also worked for IBM Research
at the T.J. Watson Research Center. At IBM, Ed was a member of the group
that developed Advanced Scout, the data mining application for NBA teams.
His research interests focus on statistical methods and human factors.
For more information, see www.virtualgold.com.
|