AN INTERVIEW WITH TOM VAN HORN, INFOMODELERS 11.04.97 by Alan Beck, editor in chief D S *
D S * : What advice can you offer to CIOs and similar executives who must ferret out profitable behaviors from complex profiles via data mining?
HORN: "First, they must keep conventional wisdom in mind. That is, there is an enormous amount of information to assist the decision-making process buried in existing data. However, getting at that data so as to glean conclusions from it is exceedingly difficult. Partly, that's because the data exists in many different places, and partly that's because it is not in a consistent format. In addition, once you've collected all the data, conclusions that can be reached from it may not be clear.
"That's why putting such information into a data warehouse and obtaining tools allowing you to perform intelligent queries and data mining against it are very important.
"But simply putting information in one place, say a data mart or warehouse, is just the starting point. A lot of CIOs and business users who are somewhat familiar with these technologies believe that once they review results, conclusions will be obvious. However, when you move data into a warehouse and utilize various data mining tools to extract information and draw conclusions, you must keep in mind that typically this is legacy data of some sort. When the databases in which this data currently reside were built, the intent at that time was not to structure the data in such a way that you would pull it out via these tools. Many of these databases are years -- if not decades -- old. The kinds of techniques employed for analyses today were not available then.
"Most of the data that is analyzed is transaction data. It was originally structured to effect transactions quickly and reliably. So one of the most important things to note is that information in the data warehouse must be modeled to ultimately be extracted correctly and meaningfully. This is the primary functional area for our products, i.e. data warehouse design and modeling.
"The correct modeling of information, when it is put into a data warehouse, facilitates correct gleaning of knowledge when data mining and/or query tools are used against it."
D S * : What are the guidelines for data modeling?
HORN: "There are two primary goals with respect to data modeling. Although the first is obvious to database designers, it is not necessarily done well. You want the data warehouse to be modeled for performance -- in this case not transaction performance but query performance. So you try to identify the queries or the kinds of analysis via various kinds of data mining tools that will be done regularly against the data warehouse.
"For example, if you have retail data, and you're trying to find certain trends tying to certain demographics, then you'll put some retail and customer data in the warehouse in order to search for trends where, say certain kinds of customers buy certain kinds of items at various retail locations. So you attempt to design the warehouse in such a way that the queries that you think are going to get performed frequently will be easy to do.
"The second goal is to design the warehouse in general so that it performs well simply from a speed standpoint. Now those two factors are linked -- and there's another piece: Database designers are familiar with SQL and various kinds of modeling techniques. They can do a good job putting them together, if they have some decent modeling tools to work with. However, 99% of those who are going to use the information, whether directly through some analysis or query tool or merely through receiving a report, know nothing whatever about database design or modeling. As a result, when you interview them -- because they are the ones who must tell you what they want to get out of the system -- you have to do it in a language that's not SQL- or computer-based. You must interview them in a language they understand, such as English. InfoModelers, for example, has a tool that can capture data warehouse requirements -- actual design rules -- with objects and verbs, or relationships between objects, in such a way that the kind of queries or reports needed can be identified. And at year's end, we'll have a tool that allows queries themselves to be tendered in ordinary English."
D S * : What's the best way to ascertain and test database dimensions for most effective profiling?
HORN: "This depends largely on what the users of the data warehouse are looking for. Typically, you capture the queries that are formulated for certain objects. For example, if you have a query defined as the list of best customers, the actual algorithm for determining the best customers may change over time. So you must define it through various objects and filters which are, in turn, determined over time: it might be most sales or most repetitive sales or most sales in a given time period, etc. But you must first capture the kind of information that business users themselves use to define their top customers."
D S * : Do you use fuzzy logic in your products?
HORN: "I'm familiar with it, but we don't build fuzzy logic into our products now. We use various estimation algorithms to establish what's called the shortest join path across the database to establish what the customer really wants. Pieces of data may be related in many different ways. There are basically two ways to identify the best path in a query across the database. One is proximity: for example, if two columns are in the same table, they're just one jump apart; you don't need to do a join. That's helpful both in the construction of the query and in its eventual performance.
"There are also primary and foreign key relationships within the database that show direct linkages from one object to another. You can use those as kind of a weighting algorithm to figure out the best path across a very large database for a complex query or to develop an algorithm to search for patterns, as is done with various data mining tools."
D S * : How much should corporations expect from data mining technology now? Given a well-organized corporate database, how much important information can really be revealed through data mining and knowledge discovery?
HORN: "This must be examined on a case-by-case basis. There are two primary areas where people can find value: First, there are plenty of examples with terabyte or multiterabyte databases. That represents so much data that there is no simple way to weed through or do simple queries against it. Over the next few years, for large telephone companies, retailers, banks, credit card processors -- folks who deal with many millions of customers and lots of information on each one -- the problem is just going to become harder. Some sort of automated intelligence will be required to sift through all those patterns. That's why data mining is on the forefront. Here you're trying to find value capturing the golden nuggets, the low-hanging fruit; finding new things about customers that allow you to increase revenue.
"In addition, there are cost-reduction opportunities that have received much less attention. People are capturing more and more data about processes. For example, the aircraft manufacturer Boeing, particularly in their merger with McDonnell Douglas, are integrating their processes to broadly capture possibilities for cost reduction. They want ways to analyze data and process elements across various manufacturing processes to identify best practices. It might be possible to determine trends for cutting costs that lie in divergent parts of manufacturing -- or perhaps in a commercial division versus a defense division. A group of human analysts would take years to accomplish this. So there are massive opportunities for cost reduction.
"You can identify a lot of potential value that can be saved. And if that potential is captured, these projects will pay for themselves. But the fact of the matter is this: only one-third, at best, of these projects are sucessful, in terms of getting done on time, on budget, and producing quantifiable results that significantly exceed the costs involved. So we really need to improve the hit rate.
"A major problem in doing these projects right is that the requirements are not captured up front. That's why we emphasize proper business and data modeling up front, so that you can make sure the design is done right the first time."
For more information, see http://www.infomodelers.com
---
Alan Beck is editor in chief of D S * and vice president of publications for
Tabor Griffin Communications. Comments are always welcome and should be
emailed to alan@tgc.com