THE BASIC REQUIREMENTS FOR A SUCCESSFUL DATA WAREHOUSE
by Zak Pines
Before a business can perform data mining on its data to gain strategic
insights, the company's data warehouse must be in proper shape. These
warehouse requirements can be broken down into two fundamental necessities -
the data has to be accessible, and the data has to be able to support the
business view.
For the data to be accessible, it must be saved in a format that is
relatively open. Such databases as DB2, Oracle, Sybase, or SQL Server would
meet this criterion. If the data is saved in a closed format, another program
must be developed to open it - not the worst situation in the world, but one
which would require time and money.
The first requirement is straightforward - a member of the company's IT
staff will know immediately how the data is formatted. The second criteria,
that the data must support the business view, is slightly more involved, and
may require some closer scrutiny of the data itself.
For example, if a bank is interested in analyzing customer retention, then
data must be collected on a customer-to-customer basis - not an
account-to-account
basis.
The account-to-account collection is incomplete if one is interested in
analyzing conditions for customers renewing vs. canceling their accounts. If a
customer moves from one city to another, and cancels his account at the branch
in his old city and starts an account at the branch in his new city, the
system would record this data as losing a current customer and bringing in a
new customer. But in reality, the customer is being retained but simply
making an adjustment to his account status with the bank.
Even if the company collects the name of the account holder, the two
accounts cannot necessarily be linked. It is possible that two different
account holders have the same first and last name; it is also possible that
one person's capital is being used in accounts under two different names - a
personal checking account under the individual's name and an investment
account under his broker's name.
If a problem such as this does develop, whereas the data being collected
does not serve to address the business need, the company must make some
changes. First of all, all future data collection should be adjusted
accordingly - data should be collected on every customer, where every account
that the customer holds can be indexed by a unique customer identification
number.
This, however, will not solve the problem as far as data that has already
been collected - and the bank cannot afford to simply ignore all of its past
data. Instead, the IT staff can make educated guesses in translating the
account data into customer data, but this new data may not be completely
accurate.
Of course, if the company's data has been collected along the appropriate
lines in the first place, then this problem does not exist, and data mining
can begin without a hitch.
If problems are evident, it could be a difficult process in altering the old
data and transforming the data collection process. But it is a step that will
be rewarded down the line when a data mining system can be set up to draw
strategic insights from the data to help support the company's business view.
Zak Pines is an Analyst and Special Operation associate for Virtual Gold,
Inc, an industry leader in intuitive data mining software. Pines is involved
in developing end-to-end data mining solutions in various industries. Prior to
joining Virtual Gold in 1998, he worked at the IBM T.J. Watson Research Center
(Hawthorne, NY), from 1995 to 1997. While at IBM, Pines helped develop
Advanced Scout, a data mining program used extensively by coaches of the
National Basketball Association to devise new strategies based on the
automatic identification of hidden patterns in game data and video. Pines is a
graduate of Yale University with a B.A. in economics.
For more information, see www.virtualgold.com
|