WHERE WILL MODELS LIVE?
by John K. Thompson
Now that the standards camps are beginning to form there will be many questions that will start to be debated at conferences, tradeshows, and in marketing collateral such as product slicks and other glossies, and, of course, in the most popular weapon in the high tech marketplace, the white paper.
One question that has been consistently discussed over the past 12 to 18 months is: where will the models that are produced in the data mining phase of the process reside? Will the predictive and descriptive models reside in the relational database management systems (RDBMS) or will they reside in another location? One of the locations or facilities that has been put forward as a possibility are the systems that are being designed to store and manage XML documents. I am certain that there are other ideas in the marketplace that I am not privy to and those will come into the light in the near future, but for now let's confine this article to the two options mentioned.
First, let's consider the RDBMS option. The storage and maintenance of models in the RDBMS leverages the existing infrastructure that virtually all corporations have in place today. With the extension of the RDBMS engine to include Object/Relational storage and management capabilities these systems are certainly capable of storing models as objects and managing the class of objects in an effective manner.
Herb Edelstein, of Two Crows, noted in the most recent issue of Data Mining News that storing models in the RDBMS improves performance and improved performance leads to better models, and better models are the bottom line. This approach or line of reasoning assumes that data mining will be performed against data that is resident in data warehouses or other data collections that are maintained in merchant databases. I think that this is a good assumption, but we must be careful as to when this will be the norm rather than a function that is more marketing hype than reality, as is the case today.
This approach is one that will feel more comfortable to the personnel in the Information Systems (IS) departments. Models can be built against the databases or data warehouses and then stored in the same RDBMS as the data that produced the model. This environment does not require new skills or new personnel, but extends the use of existing technologies and leverages the existing skill base.
RDBMS systems are expanding to include a variety of different analytical resources. Recently RDBMS systems have been expanded to include multi-dimensional data cubes widely used in On-Line Analytical Processing (OLAP) systems. SPSS recently announced an interface from their statistical tools into the multi-dimensional data stored in the Oracle Express product. The RDBMS is being viewed by many in the Business Intelligence software marketplace as the central storage mechanism for all of the data and analysis that is used and produced by an organization's analytical community.
This approach is sensible, leverages and extends existing resources, and is being incrementally realized by the individual contributions made by a number of independent software vendors.
Second, let's consider the option of storing and managing models as XML documents. The storage and maintenance of models in these emerging repositories provides a compelling argument for corporations to consider implementing XML repositories. The proposed repositories are being built with the document management paradigm in mind. This is a simple to understand and easy to use concept. The user creates a model and stores the model in the repository. The model is then refined a day or a month or a year later and the new version is stored along with the old version.
This mode of operation makes it easy for people who are actually generating models to manage the portfolio of models themselves. The majority of personnel who are currently generating models for organizations are employed outside of the IS organization, and as such, probably do not have read/write access to any part of the RDBMS environments maintained by the organization. Database Administrators (DBAs) are loath to give write access to their environments to end-users. Who knows what they will store in there, but in the end isn't that the idea? To allow the modelers the freedom to build as many models as it takes to achieve the maximum level of descriptive or predictive power possible. RDBMS systems are powerful and useful tools, but they often are controlled by the IS staff. Using repositories outside of the RDBMS eliminates some of the restrictions that would be productivity inhibiting for the people actually building models.
Storing models in stand-alone repositories allows a wide variety of personnel to view, check out, and use the models in an easy to use document style search and retrieval system.
The performance of these models should not be significantly different than those models stored in the RDBMS. Models need to be loaded into memory and executed in either scenario. What does make a significant difference in performance is the access methods employed, data organization, and algorithm implementations, but these are considerations in both scenarios.
Storing models in these new facilities does bring one serious consideration to mind. And, in reality the issue must be dealt with in either implementation, and that issue is of metadata creation, storage and management. This has been a thorn in the side of the Data Warehouse and Business Intelligence community for quite some time now. No one has come up with a convincing and compelling system for the problem. Each OLAP, Ad Hoc query, Production Reporting, Data Warehouse creation and management system, and now Data mining system is out there with their proprietary implementation for metadata management. This issue is complex in its current form and will become even more complicated with the entry of data mining tools to the equation.
At this point in time I am agnostic about which approach has more merit. I can argue convincingly for both approaches. From a quick review of the players in the field it is my observation that there are more and larger players that are aligning themselves with the RDBMS resident camp, but that makes sense because RDBMS vendors and their partners are entrenched in the market and have much to gain from winning this debate. The players that are advocating the document management approach are newer to the analytical marketplace, but they bring a compelling argument to the game. The basic proposition is that the models should be stored in a format that is an open standard, and the models should be managed by the people who create them in an easy to use manner. They go on to conclude that the content of the models should not matter so much as the ability to access and use the model simply and quickly. This last point I don't fully agree with, but I agree with the premise that the models should be easy to manage and access.
It's early in the game. Time will tell where the models will go to live.
---
John Thompson, Vice President - Marketing, Magnify, Inc.
I'd like to
hear your thoughts. You can reach me at jkt@magnify.com