DATA MINING BESIDE A WAREHOUSE
by Dr. Kamran Parsaye, Information Discovery, Inc.
Introduction
Data warehousing, OLAP and data mining have often been viewed as related activities. Yet, as I showed in New Realms of Analysis (Database Programming and Design, April 1996), they work on different computational spaces. Data access operations such as query and reporting deal with the data space, OLAP uses the multi-dimensional space and data mining takes place on the Influence Space. The four spaces which form the basis of decision support. They are the spaces for data, aggregation, influence and variation. A fifth space based on geographic relationships may also be used for some analyses.
A data warehouse is thus the natural place for storing the "data space". It is where we store base level data elements that are later analyzed to deliver information. And, just as OLAP is no longer viewed as a pure warehousing effort, a datamine is where we perform analyses to deal with the "influence space".
When using a datamine, we bring a segment (and not a sample) of data from a warehouse (or other sources) to the datamine and perform discovery or prediction. The process of mining this data segment is called an Analysis Session. For example, we may want to predict the response to a proposed direct mail campaign by analyzing previous campaigns for similar products, or we may want to know how customer retention has varied over various geographic regions, etc.
The Warehouse, the Mart and the Mine
There are three separate components to an enterprise-wide decisions support system:
While the data structures used within the warehouse and the mart may be similar, the data structures used within the datamine are significantly different. The data mine differs from the data warehouse not just in the size of data it manages, but in the structure of the data. The content of the data in the mine is also often different from the data in the warehouse, because it is often enriched by additional external data not found within the warehouse. However, content aside, the key issue about data mining architecture is that the existing theories of data structuring do not apply to it.
Data Mining Above, Beside and Within the Warehouse
Once we accept the fact that the data mine is distinct from the data warehouse, the next logical question is: "Where does the datamine actually exist? Is it a separate repository next to the warehouse, a set of views above the warehouse, or just part of the warehouse?" We can answer this question in each of these three ways and get a different architecture for the datamine.
The datamine can exist in three basic forms:
Datamining "above the warehouse" provides a minimal architecture for the discovery and analysis. It is suitable only in cases where data mining is not a key objective for the warehouse. In this approach SQL statements are used to build a set of conceptual views above the warehouse tables. And, additional external data from other tables may be merged as part of the views.
The views built above the warehouse may either be materialized (i.e. saved to disk as new tables), or not. Therein lies the fundamental problem (if not contradiction) built into this approach. If the views are not of significant size, then serious data mining can not take place. However, if the views are of a significant size, then without materialization the effort in computing them again and again will require very large amounts of processing power -- in some cases significantly affecting the availability of the warehouse resources and interfering with other applications performing indexed retrievals.
On the other hand, if the views are of significant size and they are materialized, we are no longer datamining "above" the warehouse and will be using a disorganized form of the third approach, i.e. data mining within the warehouse. If the views are materialized, the third approach will almost always work better, because it can utilize a suitable data distribution approach and a specific processor allocation strategy, as well as using different data structures for data mining. Without these precautions, the number of potential pitfalls increase rapidly, sacrificing both performance and functionality.
Hence data mining above the warehouse should be restricted to applications in which data mining is only of peripheral business interest, and not a key objective. However, holding this view is often a big a business mistake in itself -- i.e. why have so much data in a warehouse and not understand it?
In most cases, data mining is effectively performed beside the warehouse, with data structures that lend themselves to detailed analyses. And, in most cases additional data suitable for the analyses is merged with the warehoused data in order to perform specific analyses for focused business needs.
Data mining beside the warehouse both overcomes and sidesteps several problems at once. To begin with, it allows data mining to be done with the right data structures, avoiding the problems associated with the structures of the data space. Moreover, the paradox of warehouse patterns (see the next issue of DS*) can be avoided by selecting specific data segments, corresponding to specific business objectives. And, the interactive exploratory analyses that are often performed in the datamine with "wild rides" through the data no longer interfere with the warehouse resources that are responsible for routine processes such as query and reporting.
In fact, different business departments can use their own datamines that address their specific needs, e.g. direct marketing vs. claim analysis. The data is then moved from the large warehouse to the mine, is restructured during the transformation and is analyzed. It is, however, important to design the transfer and transformation methods carefully, in order to allow for optimal "refresh methods" that require minimal computing. For instance, as we bring new data into the datamine every day or every week, the over-head for re-aggregation should be minimized.
In some cases, where the warehouse is a very large, massively parallel processing (MPP) computer, the data mine may actually reside as a separate repository within the large warehouse. This is very similar to a data mine beside the warehouse, where the mine uses a portion of the physical warehouse, but is independent of the warehouse structures, in effect being a "republic within a republic".
In this scenario, the disk apace and the processors for the datamine are specifically allocated and separately managed. For instance, on a "shared nothing" MPP machine with 32 processor, the disk space for the data mine is separately allocated on 8 of the 32 nodes and 8 processors are dedicated to data mining, while the other 24 processors manage the rest of the warehouse. And, when needed, additional processing capability may be directed towards the needs of data mining.
Although this idea may sound attractive based on arguments for centralization and scalability, in practice it usually leads to loss of flexibility, without providing any significant benefits for data mining. In most cases, when we consider at the technical, marketing and business issues, the problems with mining within the warehouse multiply quite rapidly, and the datamines planned for use within the warehouse will eventually find themselves beside it.
The key point is that the likelihood of serving the needs of many people within the data space is much higher that the likelihood of serving their needs within the multi-dimensional and influence spaces. While the data elements may be almost the same for several departments, the dimensions, the influence relationships and the predictive models they all need will vary far more than their simple data needs. Hence, the datamine within the warehouse will soon become the lowest common denominator of all designs.
Therefore, while the design of the data space may be subject to compromises to please the various user groups, there should be no compromises in design of the data mine where serious and detailed analyses take place. The data mine should be optimized to deliver effective results by focusing on specific business needs, because influence analysis is so much harder than data access.
References:
Parsaye, K., "New Realms of Analysis". Database Programming & Design, April
1996.
"Large Scale Data Mining in Parallel", DBMS Magazine, March 1995.
Parsaye, K., Chignell, M.H.: "Intelligent Database Tools and Applications".
New York: John Wiley and Sons, 1993 .
Parsaye, K.: "Data Mines for Data Warehouses". Database Programming &
Design, September 1996.
Parsaye, K., "OLAP and Data Mining: Bridging the Gap". Database Programming
& Design, February 1997.
Parsaye, K. "Machine-Man Interaction" DM Review, September 1997.
For more information, see http://www.datamining.com