[ Table of Contents | NEXT ARTICLE ]

THE DISJOINT BETWEEN DATA STORAGE AND DATA MINING
by Ed Colet


There's an obvious relationship between how data are stored and represented and what analyses are subsequently possible. Despite this inextricable link between data storage and data analysis, the nature of this relationship is often overlooked. Data mining analyses are often conducted on data stored within relational database management systems (RDBMS). But this remains quite complex and difficult. As such, the effective utilization of data mining may be limited.

In industrial settings, the most common form for data storage and representation is with an RDBMS. An RDBMS is a system in which data are stored as a logical structure of tables and these tables are implicitly related to each other via shared fields that contain matching values. Structured Query Language (SQL) is the standard language for the access and retrieval of information from RDBMSs. In a well designed system, accuracy and consistency of information at the field, table, table relationships and business levels are ensured.

In research settings where data mining researchers are actively working on better ways to find patterns hidden in large amounts of data all of the above data storage issues are rarely given a second thought -- data is simply assumed to be stored "adequately".

It practice, the majority of time and resources are spent pre-processing the data - to prepare it for data mining. This is symptomatic of an apparent productivity and efficiency gap between data storage systems and data mining applications.

In deploying a solution, the data mining application has to be able to connect with a distinct data storage system. Having a pre-requisite RDBMS installed enables the data mining application to safely assume that data are stored in a well defined and well understood structural framework (of relational tables).

Currently this separation of the application using the data from the application storing the data is viewed as the advantage of data independence -- so changes in the underlying database design do not affect applications that are dependent on the data. At a systems-design level, this separation is an advantage.

From a systems level, the data mining application expects data to be stored in tables. But the data within a table has to be represented and structured in a particular way for a particular analysis. For example, if one were to ask that an application output the result of a standard statistical analysis such as a Pearson correlation, the usual way to do this is to have the data represented across several variables and the analysis will look for associations across variables. If on the other hand, one wanted to then run the association rules (a form of correlation) data mining analysis, this analysis can only look for associations among values within a variable. If the data are spread across columns, it's necessary to transpose the original data so that associations within a variable can now be found. This transpose is relatively straightforward, but with large amounts of data that may need to be transposed the operation may not be so trivial.

From an end-user perspective, usability can be affected if the task of manipulating data is too separate from the task of analyzing data. Often times, data manipulations are done by the user from within the RDBMS application (using SQL) while the particular settings for an analysis (thresholds, identifying dependent and independent variables, etc) are done from within the data mining application. From a usability perspective, shifting between two distinct applications to perform a related task is unnecessarily complicated.

In current data mining deployments, this disjunction can adversely affect productivity. There are alternative approaches: One is to conduct both the data manipulation and data analytic tasks from within the same environment. This is reminiscent of the use of SAS programs that both manipulate and analyze data via PROC and DATA steps (but required learning to write SAS programs). Early data mining approaches were characterized by attempts to implement data mining analysis as SQL statements - but it was discovered that there were severe performance limitations in trying to extend SQL to do this. A more recent trend is to move data mining capabilities into the database engine, and therefore within the RDBMS system itself. NCR's TeraMiner approach moves core data mining capabilities into its Teradata database, and Computer Reseller News reported that IBM has plans to put some statistical functions from Intelligent Miner into DB2.


Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]