DATA, DATA WHO HAS THE DATA?
by John K. Thompson
In the past 20 years the computing power and infrastructure installed in companies, governments and in society in general has exploded. We, the people in the software industry, are quite excited about this. But, as I look back on my 15+ years in the software and services business, I am struck by the fact that, first of all, we are again asking our clients to create yet another collection of data from existing data elements, and secondly how we are doing it.
A bit of history always helps to clarify. We are all acquainted with the stovepipe phenomena that was and is a fact of life in On-Line Transaction Processing (OLTP) systems. Multiple systems with their own data files, storage requirements, and data organization schemes. With little to no interaction and no value add of leveraging the data across functional boundaries. Then came along relational database management systems (RDBMS), structured query language (SQL), and the rise of Decision Support Systems (DSS). All we had to do was to extract; transform, move, and load the data from the transactional formats and systems, into a RDBMS whose tables were time variant, and subject oriented for asking "what if" type questions. We now know this activity as Data Warehousing (DW). As previously mentioned, the applications for accessing the DW were called DSS in the beginning, then the term Business Intelligence (BI) came to be, and much derision was made of the oxymoronic nature of the term, and now we have arrived at advanced analytical applications (AAA or AA). Either acronym has been employed for a widely recognized non-computing use, and I am certain that the industry in general will reject them in favor of one of their predecessors or a new term yet to be coined by one of our favorite analysts groups. Business Performance Management (BPM) is being offered up by the META Group, but the vote hasn't come in on the acceptance or lack thereof for the term.
In a typical organization, this is what you would expect to experience. The OLTP systems are running smoothly handling billing, shipping, production forecasting, and management reporting, or if they are not Y2K compliant or not scalable or flexible, the systems are being replaced by an Enterprise Resource Management (ERM) system from one of the leading vendors. The transactional data is stored in a myriad of legacy systems files structures (i.e. IDMS, ADABAS, VSAM, ISAM, BDAM, etc.) or in a proprietary or semi closed file system provided as the underlying data management facility of the ERM system.
Upon allocating budget for the construction of the DW and the associated AAA, the organization's information systems (IS) department either:
In the end, the organization now has the legacy or ERM systems running the OLTP or day to day transactions, with the DW being fed from the transactional file systems for analysis of trends, verification of program impact on sales volumes, production variation, and other human driven analysis.
Again, a bit of history, the construction of databases containing information extracted from transactional systems for use in analysis was not received with enthusiasm when it was first proposed by pundits and the vendors that paid them. Most firms that I visited as a young systems engineer responded that they didn't have the time nor did the end users need that kind of access or information. Hence one of the primary reasons why vendors invented end user selling techniques effectively bypassing the IS department.
It is empirically known that the DW & AAA (whatever name and acronym sticks) are here to stay. Analyst groups size the market for software and services well into the billions of dollars. So it is safe to say that in the past decade the reticence to build analytically oriented databases and applications has been effectively squashed. Squashed not by the vendors desire to sell, but by the clear and overwhelming return on investment and the competitive advantage gained by firms that have pressed ahead and implemented the technologies in a manner that enables them to understand the market better.
So, this brings me to my point. The data mining industry, included in that grouping are software firms, consultants, and analysts are out describing how to mine desktop and large scale data sources. Let's simply drop the desktop thread from the discussion because it is not germane to the focal point. I have heard vendors, practitioners, and analysts espouse diametrically opposed views on whether data resident in a RDBMS based DW can be mined. I am certain that firms and people evaluating data mining products and services are interested in and possibly a bit confused as to whether they should be attempting to mine their data in the DW.
I have come to the conclusion, that at this point in time, it is probably premature to expect to perform large-scale data mining operations on the existing DWs. I have reached this conclusion for several reasons. Among the reasons:
Existing DWs were built with the goal of understanding and explaining trends and activities within well known business functions and as such contain data that is probably not predictive or even descriptive in nature in discovering new insights into the market. Derived data in a DW is mainly comprised of levels of aggregation. Levels of aggregation help when you are comparing events, but aggregation is not enough for data mining. The derived attributes for data mining will be comprised of the log function of a data element, or a moving average, or some other more mathematically intensive function rather than a simple addition.
When running tests on large amounts of data (low terabytes) stored in schemes optimized for data mining and comparing the run times to the same data stored in a commercial RDBMS the loss of performance ranges from a factor of 100 to 1,000. Most organizations will not accept this as being a manner in which they want to run their analyses.
IS professionals and end users of the DW & the associated AAA want to view results that are easy to read and understand in the context of the business decision they are empowered to make. The results produced by the majority of data mining tools are not easily read or interpreted, and therefore are unsuitable for casual users.
I know from experience that when I see a common feature of a wide range of products in all of the vendors slideware as a critical item to be delivered in the near future, beware this is still hype at the worst and vapor at the best.
So in the end, am I struck that we are asking firms to store their data in yet another form, optimized for discovery based analytical applications? No, and that's because the pay off is clearly there, and attainable. Am I struck by the convoluted and cloaked method in which the industry is approaching the issue? Yes.
I will paraphrase, because I can't find the exact quote, but it is similar in words and exact in sentiment: Those that are ignorant of history are doomed to relive it.
---
John Thompson, Vice President - Marketing, Magnify, Inc. I'd like to hear your thoughts. You can reach me at jkt@magnify.com