HISTORICAL DATA: THE FOUNDATION OF DATA MINING, PART I
by W H Inmon
The data warehouse consists of historical data, and that foundation serves as a basis for data mining. This seemingly simple statement of affairs appears to be very simple and straight forward. But the relationship turns out to be fraught with complications when the layers of the onion are peeled back.
Historical data appears to be static, but it is not. When examined closely, it is seen that historical data is constantly changing. The constant change of historical data presents the data miner with innumerable problems.
As an example of some of the aspects of data that change over time, consider --
AN EXAMPLE OF THE CHALLENGE
As an example of the changing nature of data over time, consider an organization that does sales measurement by product by zip code (which is in reality a very common occurrence.)
In 1996, sales for zip code 80110 for product QWE are $25,000. In 1997 sales for the zip code are shown to be $125,000. In 1998 sales come in at $35,000. If all a manager did was to look at the raw data it would appear that there was a wild and negative perturbation of sales in zip code 80110. But a closer examination of the underlying meaning of data yields some interesting observations.
While it is true that from 1996 to 1997 sales went from $25,000 to $125,000, in actuality there was only a minor increase in sales because from 1996 to 1997 product QWE has redefined to be a much broader product. In 1997 products WER, ERT, and RTY were reclassified to be sub parts of product QWE. The 1996 sales reflected only sales for product QWE. The 1997 sales reflected sales for product QWE and all the subproducts. Therefore the increase in sales from 1996 to 1997 is misleading.
By the same token the drop in sales from 1997 to 1998 -- from $125,000 to $35,000 -- looks like bad news. But when examined more closely, it seems that zip code 80110 was split in late 1997 into two zip codes -- 80110 and 80111. There were no sales for zip code 80111 in 1996 and 1997 because there was no zip code 80111. But in 1998 there were sales of $130,000 for zip code 80110. In actuality, the sales for part QWE are doing just fine, even though the numbers tell a different tale when split out by zip code.
Part II of this article will appear in the next edition of D S * .
For more information, see http://www/pine-cone.com