[ Table of Contents | NEXT ARTICLE ]

BARRIERS TO DATA MINING: THE 3 PH.D. PROBLEM
By Ed Colet, Virtual Gold Inc.


A high-ranking executive at a large corporation characterized the typical implementation of a data mining project as a "three Ph.D. setup". At the risk of simplifying the situation, it was described in the following manner.

The first Ph.D., or Ph.D. equivalent is your business expert - the person that understands the business issues facing the industry in general, and the company in particular. The second Ph.D. is a quantitative or statistical expert. This is the person that knows what statistical analyses are appropriate and applicable and under what circumstances. This may also be the person that builds the mathematical models that are then applied to the business problem(s) at hand. The third Ph.D. is a Computer Science expert. This role has recently become more important with ever increasing amounts of data stored in databases. This is the person that ensures that the data to be analyzed is accessible from the databases, and that the system response time and other performance issues are acceptable.

Building the necessary technical infrastructure to ensure that this occurs is often necessary. So, in data mining, large amounts of data are typical, and it is the job of these three Ph.D's to ensure that business issues are defined in actionable ways, that quantitative models are valid, and testable against large data stores of up to terabytes of data, and that whatever technical infrastructure is necessary for data mining is available, and if not, can be built as a development project. But this type of implementation can present certain barriers that limit the effectiveness of the use of data mining.

The approach above is successful as long as the expertise is available and the competitive advantages (as measured in terms of return on investment) through data mining can be measured and are significant. But therein lies a potential problem. Expert skills or an equivalent level of sophistication are not easy to come by, and when available, can be costly to maintain. Ph.D.'s don't come cheaply, nor should they.

In extreme cases, an internal corporate data mining group has been disbanded due to the high costs to support the group, and the relatively low or unclear return on the investment of their quantitative models. In more typical cases, the assembled data mining group breaks up. The loss of one of the Ph.D.'s is costly in terms of replacing his/her expertise. In general terms, this is due to the fact that Ph.D.'s aren't necessarily interchangeable - the very process of earning the Ph.D. marks the contribution of new knowledge in a specific area. It's unlikely that another Ph.D. has the identical expertise to step in and continue the group's work seamlessly. Turnover is disruptive, and these types of disruptions are rarely favorable in terms of a return on investment.

Another potential problem with the "3 Ph.D.'s" situation is that these analysts are typically separated from the front-line managers - the ones with direct access to customers on a daily basis. Consequently, the persons that interact with the customers and can have a real impact on the business are often unaware of patterns that can be beneficial. It may also be the case that the analysts are separated from the high level executives who are the ones ultimately responsible for making the business decisions. Consequently, the best decision based on the most valid analysis is not always the decision that is made. These situations can further erode the potential advantages found through data mining.

The above situation characterized by rare skill sets, high costs, results that are difficult to act upon, and, if software development is necessary - long development times, all represent barriers to successful implementations of data mining. Eliminating these barriers can result in a better implementation of a data mining application. The solution to this problem is not to get rid of the Ph.D.'s and their expertise, but instead, to capture and use their expertise more effectively.

At Virtual Gold, Inc., the elimination of such barriers is consistent with our mission. The VirtualMiner Framework is one of our products designed to do this. It provides database developers with the tools necessary to build data mining applications. It eliminates the problem of finding developers with data mining expertise by providing developers with tools that enable them to build sophisticated data mining applications. The VirtualMiner Framework can utilize and integrate existing analytical tools and techniques. This utilizes the expertise of the quantitative expert(s) effectively. By ensuring that the resulting applications are usable by a wide variety of personnel, it makes it possible that both the front-line personnel and/or high level decision makers can be equipped to benefit from the use of data mining applications.

---

Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]