THE STANDARDS RACE BEGINS IN DATA MINING
by John K. Thompson
It's official! The Data Mining Industry has decided that it can benefit from standards. The most important question is: What should be standardized and who should control "the standard"? The wrangling has begun.
Technologies for use in analytical applications have gone through this cycle at least twice in the past 15 years. First, it was the Relational Database Management Systems (RDBMS) and Structured Query Language (SQL), then most recently On-Line Analytical Processing systems (OLAP) went through the standards gauntlet. Both technologies have matured and vendors and consumers have benefited greatly from standardization. Standardization has allowed developers to write access, transformation, analysis, and presentation logic with relative confidence that what they have written will work on multiple systems in a wide variety of platforms.
The same cannot be said for any part of the data mining world today. At the present time, data mining is akin to the wild west, pretty much anything goes, and nothing works in conjunction with technologies that have been implemented in related systems. Each vendor has their own prescribed processes fitting into their own overarching methodology. No matter how many methodologies and subsequent processes are polished and presented, in the end, at implementation time, the result is the same, the data must be extracted, transformed, mined, and interpreted.
This, of course, and by necessity, is high level and simplistic, but it is the essence of the issue. Data Mining systems on offer today are hard to use, require specialized skills and do not interoperate well with existing analysis systems. And, for these reasons and more, vendors have decided that it's time to start the standards efforts.
I have extracted the following quote from the June 23, 1998 issue of KDNuggets, an electronic newsletter focused on Knowledge Discovery. The quote illustrates to me a viewpoint that is quite interesting. The authors are referring to the broader field of Knowledge Discovery (KD), which encompasses Data Mining:
"The field is in the early adoption phase in the market, and we expect that within about 3-5 years, commercial products and the vendors will start entering the maturation phase. Within the next 10 years, in some form, the technology of data mining and knowledge discovery in data will become an integral part of the client/server enterprise information technology."
This quote would indicate that the KD industry, and therefore Data Mining is still in the garage stage. This quote indicates that these people believe that the field is just on the cusp of being acceptable to innovators. I am starting to stray from the topic at hand, so let me conclude this point. Discussing market timing is always interesting and fun, but if the industry is truly this far from mainstream adoption, then only the large, well-funded corporations and the truly frugal smaller players will be around to see wide spread adoption.
Back to the topic at hand, standards. There are three standards efforts currently underway:
Quite a collection of approaches. SIGKDD is broad ranging and has support from some very large organizations. CRISP-DM is promising and has the backing of the EU. The DMG is a targeted effort around building reliable and robust portable models. From the outside, these efforts look to be egalitarian and focused on bringing value to the consumers of data mining systems and technologies. In the end, in part they are, but in the middle of the process is a great deal of political maneuvering and positioning. The reason to work through this process is the promise of market share, dominance, and of course, profits.
Standards formulation can be tenuous business. Look at what happened to the OLAP Council. The council worked and argued its way to a standard API for accessing multi-dimensional databases only to be usurped by the Microsoft announcement of their own, competing API. Almost all vendors are supporting the Microsoft API, with a significant number supporting both APIs, but this will not last. Software vendors are not interested in supporting two standards. The market will vote with their wallets as to which standard they wish to use. If recent history is any indicator, the council's work may be for naught.
It's early in the standards process for data mining. If the people at KDD are right, we have about 3 years to wait and see who wins. I don't agree with that time line. I think that the parties that hold the purse strings to the investment vital to the industry are looking for a return in a shorter time frame. Corporate managers, and venture capitalists are looking to place their money where it will perform today, not 5 to 10 years from now.
Remember, this is the fun part. The uncertainty of it all. The building of an industry from the ground up. Currently, diversity and raw approaches are springing up everywhere you look. Soon, the slow contraction around superior ideas and approaches will begin. These superior ideas and approaches will be embodied in the competing standards. Then, it will be time to cross the chasm and go for the widespread adoption by mainstream organizations, and the fun will be over.
But for now: It's time to go for the win! May the best approaches and
standards win.
---
John Thompson, Vice President - Marketing, Magnify, Inc.
I'd like to hear
your thoughts. You can reach me at jkt@magnify.com