A REPORT FROM KDD-98, THE 4TH INTERNATIONAL CONFERENCE
ON KNOWLEDGE DISCOVERY AND DATA MINING
by Ed Colet
KDD-98, http://www.kdnuggets.com/kdd98/index.html , was held in New York city from August 27-31. In keeping with it's history of co-locating with conferences of related disciplines, this year's conference was co-located with VLDB-98 (Very Large Databases). In the three previous annual conferences in which I attended, there was some concern that the incredibly rapid growth and hype associated with data mining may backfire, and the promise of data mining will diminish to insignificance. This doesn't appear to be the case, and the field appears to be thriving with a healthy amount of research activity. Paper submissions increased by 54% (up to 250 from 162) over the previous year, and 27% of submissions were accepted. Attendance seems to have grown also, with a mix of academic and industrial researchers, applied practitioners, and miscellaneous folks such as consultants and press. What follows is an overview of this year's conference and some personal impressions.
Conference Overview
This years conference featured 8 half-day tutorials on these topics: (1) database methods for data mining, (2) data reduction, (3) high performance data mining, (4) fraud detection and discovery, (5) new-wave non-parametric regression methods for KDD, (6) smoothing methods for learning from data, (7) evaluating knowledge discovery and data mining, and (8) a comparison of leading data mining tools.
In addition to the tutorials, there were 67 papers. Eighteen plenary papers were grouped into 6 themes, in which 2-3 papers were presented for each theme. Themes included: (1) classification, (2) association rules, (3) clustering, (4) theory of KDD, (5) discovery in time, and (6) applications (focused on direct marketing, insurance, chemical compounds, and mining of audit data). The best paper award was given to Pedro Domingos for "Occam's two razors: the sharp and the blunt". The key idea is that Occam's razor should be applicable only if simplicity (of models) is a desired goal in itself, but the notion that the simpler model is also more accurate is not correct and will fail in most domains -- and thus significant knowledge can potentially fail to be discovered. The best paper in an applied domain was awarded to Luc Deshape, Hannu Toivonen, and Ross Donald King for "Finding frequent substructures in chemical compounds". Their paper described a data mining approach to predicting chemical carcinogenicity by finding common substructures and properties in chemical compounds. There were 49 more papers presented as posters in the poster sessions. There were also a set of invited and exhibit session talks. These were more general than the plenary or poster papers, and covered data mining (1) in the "real world", (2) on the WWW, (3) on the Internet, (4) tools, and (5) opportunities and challenges in mining for dollars.
Winners of the KDD-Cup were also announced at the conference. The KDD cup is a tools competition where participants have approximately two months before the conference to evaluate a data set. This year's data was provided by the Paralyzed Veterans of America (PVA), one of the largest direct mail fund raisers in the US. The task was to analyze the results of one of PVA's recent fund raising appeals sent to 3.5 million people and to classify those people most likely to donate. The data set contained information about PVA's promotion and giving history, the respondents' donated dollar amount, and demographic information. Winners were determined by the performance of their tool on a hold-out or validation sample. This year's winner was "GainSmarts" from Urban Science Applications, Inc. Second and third place went to SAS's "Enterprise Miner", and Quadstone Ltd's "Decisionhouse" respectively.
Some personal impressions:
There is plenty of both research activity and product development. The majority of papers both in the plenary and poster sessions focused on improvements to algorithms and techniques that promise to be better at discovering hidden patterns in data. The fact that the work seems to be a continuation-derivation-expansion of earlier work suggests that the research front is advancing via incremental steps building on earlier work and established foundations. The fact that a large number of the exhibitors demonstrating products were relatively new companies (established vendors such as Silicon Graphics, SPSS, and SAS excepted) indicates that research ideas are also making their way into products. Although there didn't seem to be a significant research breakthrough this year (or if there was, then I'm sorry to have missed it), the field seems to be evolving nicely.
So there's research activity, and there are products -- how does this play out in the "real world"? It became apparent in Gordon Linoff's talk, "Data mining in the real world" that there's a gap between what applied practitioners actually do and what researchers are actively working on. For example, according to Gordon Linoff, approximately 80% of the time or effort or resources during a customer/client engagement is devoted to data preparation, and the actual underlying algorithm(s) or products used for data mining/analysis are for the most part, only of secondary importance. At KDD, the majority of research is focused on algorithms and techniques.
One can then wonder, is there an overabundance of effort on developing underlying techniques and tools? Should more consideration be devoted to issues of deployment and integration with other aspects of the business setting? Also, many of the new algorithms are tested on data drawn from the machine learning data repository at UCI. Although these data sets are "real data" in terms of their content, they're not "real" in the sense that the data are removed from the business (or scientific) contexts from which they came.
A quip from John Elder: "when a statistician has an idea he writes a paper...when a computer scientist has an idea, he starts a company." Compared to previous years, there seemed to be a greater awareness of the potential importance of statistical techniques. The tutorial presentation by David Banks and Mark Levenson (both of whom are statisticians) advocated working with smaller data sets narrowed down to less than or equal to 12 attributes or dimensions, "otherwise one's doing witchcraft" as opposed to "modeling" (but whether very large data sets can or should be narrowed down this way will be an ongoing debate between traditional statisticians and database miners). Ultimately, statistical models based on regression still remain refreshingly powerful, elegant and interpretable.
Another of the ideas that can probably be traced to statistics is the notion of "bundling". It was recommended that rather than rely on a single technique or approach, there are benefits to be had by combining models (bundling). The idea is that rather than constructing a single model, you construct several varied models and then combine their estimates using a type of "voting" strategy, and then draw conclusions. I'm reminded of the approach already common in statistics called "meta-analysis" -- in which one reviews several studies, each of which uses a different model, and then combines the results in order to draw an overall conclusion.
So what might one conclude about KDD-98? KDD remains the premier technical conference to find out the latest developments on the research front. Some years, and some papers are more significant than others. But inside the conference rooms, the technical content of presented papers is always rich. And out in the hallways and during the coffee breaks, one can often meet someone from a different discipline with a fresh perspective to a shared problem, and ultimately discover a key insight.
---
For more information, see http://www.virtualgold.com