[ Table of Contents | NEXT ARTICLE ]

COLLECTING AND STORING DATA
By Ed Colet DSstar


Data collection is an obvious pre-requisite for data mining. Deciding what information is collected, the methods used to collect it, and how data are stored can all affect the subsequent data mining that is carried out. This column takes a look at some of the important issues associated with data collection and data storage that are necessary to ensure accurate data analysis and data mining.

Data collection can be a time-consuming and difficult procedure to do correctly, but it is necessary for valid results. If the domain and the analysis to be done are well understood, as in a scientific experiment to test a specific hypothesis, then it is easier to decide what data to collect and how to collect them. But in other cases where the domain is less well understood, where hypotheses may not be clearly specified, then it is difficult to apply the same rigorous data collection methods that are apparent in the former situation. The net effect is that substantial amount of effort has to be devoted to data preparation issues.

There are several ways to acquire data - although in current organizations the problem is abundance rather than a scarcity of data. Data acquisition methods can be represented as being on a continuum ranging from active data collection to incidental data collection. Active data collection is data collected explicitly for the purpose of a specific analysis to test a specific hypothesis. An example of this is a scientific experiment in which variables are defined, collection mechanisms designed and data analysis procedures are known ahead of time. In marketing, an example of active data collection methods is the use of a specifically designed survey questionnaire administered to a well-specified sub population. The process of active data collection also typically involves ensuring that the data are "clean" - i.e., statistical measures of reliability and validity are assessed and values deemed to be acceptable.

Incidental data collection methods refers to the acquisition of data that was originally collected for one purpose, but is being analyzed for another. A lot of data within organizations are characterized as being of this nature. For example, the responses to surveys may be "reused" for other analysis. Also, the common practice of purchasing data from third party sources is an example of incidental data collection. In practice, many direct marketing campaigns are based on data purchased from third party sources. Relative to active data collection methods, it is more difficult to ensure that data collected via incidental methods are clean. This is because the "history" of such data (i.e., if and how the data have been transformed or aggregated) may not always be known. This can affect analysis - for example linear trends may appear but in actuality are an artifact of a transformation of an underlying process that is non-linear.

Unfortunately, with active data collection methods and especially with incidental data collection methods, it's easy and convenient to assume that data are clean. Assuming this in error can be costly, but undertaking processes to ensure data are clean can also be expensive and time-consuming.

Because data mining involves large amounts of data, it's not enough to consider the "what" and "how" data are collected. Storage considerations must be addressed because this affects performance of data mining analyses on the data set. Relational databases can store and manage data efficiently. But oftentimes, large-scale analysis requires data to be represented in specialized means. An example of this is the use of OLAP cubes (On line Analytical Processing). OLAP is a technology for generating reports about data. In an OLAP system, data are represented as a cube, which is essentially a multi-dimensional database (MDD). This enables rapid processing so that response times are in seconds (rather than hours) and affords additional analytical capabilities not possible to do with straight SQL (Structured Query Language) queries on data in a relational database. Designing OLAP cubes is non-trivial task. First, one must identify the dimensions of a cube. This is determined by the types of expected queries to determine which attributes are to be translated into a cube's dimensions. The resulting sub-cubes represent combinations of particular ranges of attributes that are represented as the dimensions of the cube. An OLAP query can then efficiently find the relevant sub-cubes and use them as the building blocks of reported output. This is much more efficient than if the original SQL query had to access individual tables to find the information. If the dimensions are defined incorrectly or inappropriately, then performance and possible queries are limited.

Only after data collection and data storage issues have been addressed, can data mining be performed. But as we've seen, data collection and data storage procedures require careful considerations to ensure that subsequent data mining processes are accurate.

---

Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]