Next Article Table of Contents Previous Article

Analysis & Commentary:

DATA PREPARATION AND DATA ANALYSIS
by Ed Colet

Underlying a successful data analytic or data mining application are two general phases. Data preparation is a phase that addresses the collection and storage of data. Data analysis involves the analysis of data once the storage and representation issues have been resolved. These two phases require distinct sets of technical skills. Surprisingly enough, despite their interdependence the required skills often requires different and separate teams of people, that can introduce unnecessary delays unless they work in concert. But perhaps there is a better approach.

Data preparation is a critical part of the process before any data mining or data analysis can occur. Often times, this is where the bulk of development time associated with developing a data mining application. Data preparation involves taking raw data from the operational or transactional systems and moving it to data storage systems such as a large scale warehouse. The data in the storage systems are usually further processed to create the appropriate sets of relational tables, and logical views.

The particular architecture for data storage is driven primarily by two factors: One factor is based on the subsequent analysis that will be conducted on the data. An example of this is the design of OLAP cubes for the "slicing and dicing" of data typical of an OLAP analysis. A poor design means the data and the dimensions are not represented optimally for the appropriate types of summarization. In the worst case, summary computations and results are not possible -- and how the data are stored has to be reworked. In other cases, it's processing time and response times that suffer but summary reports are eventually output.

Processing time and response time is another factor that can drive the design of the database in the data preparation phase. Deciding to implement several tables as opposed to a single table containing all data, deciding how distinct tables are to be related to each other, what attributes should serve as primary keys within each of the tables, what tables should be created as logical views, etc. etc. are all design issues that a database designer works through. Oftentimes, the motivation for such design decisions is to minimize the amount of disk space and/or query response time. Optimization of query response time is based on designing for the most frequent queries that can be expected which may not be the same as the processing associated with sophisticated data analysis for model building.

After the data preparation and storage issues have been worked out, the data in a position to be analyzed. The objective of data analysis is to look for systematic patterns in the data that can impact the organization's business practices. Data analysis can take many forms from simple queries that count totals and sub-totals to more sophisticated model building analysis such as regression equations.

While SQL is the language commonly used to store and manipulate the data during the data preparation phase, SQL is also the language used for certain types of data analysis -- such as frequency counts and sub-totals. But it is generally acknowledged that SQL has performance limitations when used to analyze large datasets -- even if the underlying analysis is a simple frequency count. In order to be able to develop more sophisticated models on large data sets, specialized statistical analytic tools are often employed. Data analysts often use other programming languages such as SAS or S-plus, rather than SQL for statistical model building. If the data are still kept within the data tables, it is necessary for the analyst to fully understand the database architecture and the underlying tables. Or, the data analyst may extract and export the data from the tables into a separate statistical analytic application such as SAS for statistical analysis.

Herein lies a potential problem. If the data are exported to a statistical application for analysis, much of the careful work carried out during the database design and data storage phase does not help during the data analytic phase. If on the other hand, the data remain within the data tables for analysis but the analyst does not have the necessary DBA skills, then two persons (or two teams of people) are required to build the data analytic/mining application. Data preparation/data storage skills are often separate from the data analytic skills. Because of this disjoint there can be productivity delays and disruptions in the application building process.

This separation of skill sets can be bridged by ensuring that DBA's can themselves build the sophisticated statistical models by acquiring the necessary skills in statistical analysis. Alternatively, the analysts could acquire the skills of the DBA's so that they're familiar with the database tables and can manipulate and create necessary views to support their analysis. But in practical terms, it's unlikely to expect that DBAs become equally adepts as statisticians; it is also unlikely to expect that analysts become equally familiar with database architectures and database design issues.

A more realistic solution would be to ensure that the application building tools are easy enough to use by the end-user (whether that end-user be the DBA, the analyst, or someone else) so that both the data preparation aspects and the data analytic operations can be carried out in concert with each other. This means that the representation and storage of data be optimized for system performance while being compatible with subsequent data analytic routines. Virtual Gold's VirtualMiner Enterprise (VME) for application building is an approach designed in part along these lines.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see www.virtualgold.com.

Top of Page


Previous Article  |  Table of Contents  |  Next Article