Data mining is vaunted as the solution to all market awareness problems, the means to a better corporate future, the uncoverer of endless profitable pieces of knowledge and worst of all, everyone's user-friendly application. This image is powered by vendors' marketing drives, which are to ensure their slightly dated relational or multi-dimensional databases and supporting products still have a market to compete within. It is further enhanced by journalists jumping on the 'data mining cures all' bandwagon. With this amount of hype, is it possible to uncover the realities?
Intelligent software requires intelligent users
Let us be clear from the start, data mining is a technology that can give organisations a real competitive edge, it can improve internal productivity and can be used for a myriad of applications and problems. It is not, however, a simple technology, which through its graphical user interface allows untrained personnel to grasp for the nuggets of information that could increase competitiveness or wipe out the opposition. It is a complicated software tool, an intellectual giant among the burgeoning data access and decision support software market.
It requires an expert to use the more technologically advanced tools. These tools use complex machine learning algorithms such as inductive logic programming, neural networks and fuzzy logic. These algorithms have their history in the research departments of universities. Until recently these approaches to studying data were done during research for PhDs or as post-doctoral studies. Now more powerful computers and software technology are creating opportunities to use these algorithms on huge corporate databases. Newer tools do offer more sophisticated front ends that allow access to data and the ability to manipulate and visualise data from the corporate desktop. They allow untrained or inexperienced workers to view data, but the true uncovering and understanding of significant relationships can still only be done by trained experts.
Data mining process
The various stages in the data mining process also show the need for trained personnel to ensure organisations gain the most from their data mining investments. Data mining may be regarded as taking place in four main stages:
Data pre-processing is concerned with data cleansing and reformatting, so that the data is now held in a form which is appropriate to the mining algorithms and facilitates the use of efficient methods. Reformatting typically involves missing value handling and presenting the data in multi-dimensional views suitable for the multi-dimensional servers used in data warehousing.
In exploratory data analysis (EDA), the miner has a preliminary look at the data to determine which attributes and which technologies should be utilised. Typically, summarisation and visualisation methods are often adopted at this stage.
When considering data selection, there is a choice between focusing on certain groups of attributes or for large amounts of data random tuples may be chosen. The sparcity of your data or clustered nature of the data may dictate the best approach to use to maximise results.
Finally, the glamorous act of knowledge discovery can be carried out on this sample of data. Numerous different approaches can be used in this stage, it is important to select the one most appropriate to your data. In order to ensure the best results from the knowledge discovery process it may be important to iterate round the final three steps to realise more detailed and therefore more useful patterns.
None of these steps are difficult, although knowing what to look for and where to sample your data set to ensure fast returns requires experience. How many executives and decision-makers could honestly say they would know what to look for and know where to look for it? In a world where time is a precious commodity, it is inconceivable that these people could afford the time to do the required investigations and go through the iterative processes described above.
Final note of caution
There is no point in buying a data mining product, presuming you can just load the data and leave the tool to run and out will pop answers of great significance. Data mining algorithms, while very clever, need considerable guidance. It is important to use them in conjunction with some previously formulated business goals. What do you want this tool to achieve? Where do you want it to search? You must know about your data before you can ask the right questions or even know what you want to find out.
This appears to fly in the face of knowledge discovery, but I would argue: how do you know you have found something of significance if you do not know what something of significance would look like? There is no point fooling ourselves that the answer is waiting to pop out of the database: it needs to be found and it will be found with help from data mining products, not by data mining products alone.
Data mining is a technology which will make real differences for every industry in most market sectors, of that there is no dispute. Its ability to uncover the useful information hidden in huge databases with the minimum of intervention and fuss, however, is a popular myth which must be dismissed if organisations are to get the most from their data mining and data warehousing investments.
---
Stuart Haire is manager of data warehousing for Smith System Engineering Ltd. For more information, see http://www.smithsys.co.uk/ or email sahaire@smithsys.co.uk