DATA EXTRACTION: FACT AND FALLACY
AN INTERVIEW WITH TORBEN MOLLER, IA CORP
by Alan Beck, editor in chief
D S * : Please differentiate the notion of data extraction from standard data preparation and cleansing.
MOLLER: "There is an inherent structure in complex reports, statements, etc. By reverse engineering, one may recreate that structure, so that data may be extracted in a form that is much closer to actual use than is the data taken out via conventional data cleaning and similar methods."
D S * : Why is it closer?
MOLLER: "Because these data have already gone through processes clearing them for print-out, to be used by customers or clerks for day-to-day business."
D S * : How should data extraction and other similar tools be integrated with data mining and knowledge discovery?
MOLLER: "Such tools, and especially data extraction tools, should enable quicker access to data that are effectively lost within databases. Now it is clear that in the best of all possible worlds -- which no one has -- is a global enterprise with a global data schema, and everything existing in grand DB2 databases so that everyone can do effective SQL queries. Although that's where we'd all like to be, we'll never get there.
"In reality, one typically has a customer master-file in a database, because that was necessary for the data to be set up, updated and accessed quickly. The transactional data exist in batch or log files. You spend quite a lot of time, for marketing and internal efficiency reasons, creating extracts of these data for printing as hardcopy. In turn, these may take the form of a report distribution system on a 3270 screen, COLD system or whatever.
"But if I can take this and grab some data from it, I can get there, maybe a little faster -- although perhaps it won't be better than the whole process taken to the end. I see it as an opportunistic tool, if 'opportunistic' is taken in the positive sense."
D S * : How much should we reasonably expect from data mining technology?
MOLLER: "How successful you are with such tools depends upon how good you are at delineating and setting boundaries for your project. Those who believe they can do everything for everybody end up on a death march. You must clearly define your need to get something done and then do it. So I can use data extraction and data mining tools and be very successful -- and I can also use them and get lost. It basically comes down to common sense and good management."
D S * : So how can executives delineate projects more effectively?
MOLLER: "I'm fond of saying, perhaps somewhat flippantly, that any project that cannot be defined on a single page is not worth doing. In the Mythical Man Month, an absolutely seminal book by Frederick Brooks dealing with the issue of how projects fail, there is a discussion of the arguments that went on between two opposing groups about how OS/360 was to be designed and built. But the larger of these groups was really just in search of something to do. It's clear that projects require a small, efficient team and a clear understanding of goals. There must also be well-defined milestones that are measured along the way. The most important thing is: decide how much you're going to do up front, and don't try to be all things for all people."
D S * : Is there a serious gap between business and technical staffs?
MOLLER: "Very much so. The conventional joke runs: IT people have no time. Currently, that joke has been modified to: IT people have even less time than before, because they now have to deal with the Year 2000 problem.
"It is very difficult to find anyone who speaks both IT and business. IT people think in terms of files, codes and schedules. Business people think in terms of business goals which, if you're lucky, are well-defined. In data mining, we simply see the same difficulties of interaction between the two groups that we see everywhere else.
"We've all seen functional requirements documents get thrown over the wall to IT people. They do what they do best: come up with some kind of design, build something and throw it back. But it doesn't match what the business people wanted. So solid project management and analysis can only be implemented with participants from both sides. The best of both worlds would be possible if we could take advantage of the fact that we now have so many MIPS, bytes and bauds that we arrive at something the end-user can employ by him- or herself without too much involvement of the IT staff."
D S * : Do you see much more powerful technologies on the horizon that stand to change the current situation?
MOLLER: "Technologies such as IA's are, of course, innovative and very helpful. They acknowledge reality. They're useable. But in the grander scheme of things -- no! I think we have a long way to go before something exciting comes up. We need a whole new view of the user interface. Although SQL is wonderful, it doesn't really reflect how we look at data or how a business looks at data.
"Some cycles must really be spent to consider user interfaces, so that the user can get in there and work. That will take a while. And I don't see anything exciting coming down the pike. There's a lot of work to be done, and clearly there are some people engaged in doing it. But there's still a lot of slogging to do."
---
Alan Beck is editor in chief of D S * and vice president of publications for Tabor Griffin Communications. Comments are always welcome and should be emailed to alan@tgc.com