CRISP-DM: A PROPOSED GLOBAL STANDARD FOR DATA MINING
by Gregory Piatetsky-Shapiro, Editor, KDNuggets
CRISP-DM (CRoss-Industry Standard Process for Data Mining) is a project developing an industry-neutral and tool-neutral Data Mining process model. CRISP-DM is partially funded by the European Commission under the ESPRIT Program and is sponsored by NCR, ISL (now part of SPSS), DaimlerChrysler, and a Dutch insurance company, OHRA. There are currently approximately 180 members of SIG CRISP worldwide. See http://www.crisp-dm.org for full information.
CRISP-DM had 3 workshops (Amsterdam, November 1997, London, May 1998, New York, September 1, 1998) and recently concluded the fourth workshop in Brussels, on March 18, 1999. All of the workshops received much attention and each of them had more than 30 participants from diverse industry sectors and research institutes representing the whole range from tool vendors to end users.
The purpose of the workshops was to inform participants about the ongoing progress in developing the standard process as well as to get feedback and input for improvements of each draft that was made public available before the workshops.
In Brussels - where I attended - there was an overall acceptance of the CRISP-DM process and all participants expressed their interest in pushing forward these efforts to define a standard process for data mining.
The consortium members have developed a very impressive model and methodology for data mining process. On a high-level, the process model has 6 phases: (see http://www.ncr.dk/CRISP/process2.htm)
For each of these phases there is a list of tasks, e.g. Data Understanding consists of 2.1 Collect Initial Data 2.2 Describe Data 2.3 Explore Data 2.4 Verify Data Quality with each task having a specific output (such as a report or another dataset)
This process model is both generic and designed to be customizable, e.g. it can naturally be specialized both for specific tasks, such as CRISP-DM for classification and CRISP-DM for clustering, and also for specific business problem, e.g. CRISP-DM for attrition modeling, and even for specific business tasks, e.g. CRISP-DM for attrition modeling in telecommunications.
Two of the project sponsors, OHRA (Netherlands) and DaimlerChrysler, successfully used the CRSIP model in development of specialized process models for their practice.
While the current model may still need some fine-tuning (e.g. some suggested a separate phase for monitoring after deployment, and a privacy impact statement as part of 1), but in my opinion it meets industry needs.
The advantages for having a standard industry model are many. They will make large data mining projects faster, cheaper, more reliable and more manageable. Even small scale data mining investigations will benefit from using CRISP-DM.
SPSS is planning to have CRISP-DM to be a part of Clementine, and some consulting companies (OpenMIND, Two Crows) have already started using CRISP-DM model.
The funding for CRISP from the European Commission ends by April 30, 1999. Next steps include writing a book about it and promoting its use in the industry. There are also plans under consideration to form an industry consortium to promote CRISP-DM standards worlswide.
One specific suggestion I made is to develop standards for meta-data and CRISP-DM reports using XML. This would help to improve interoperability of different tools and stimulate innovation.
In the project meeting after the Brussels workshop it was decided to go forward in building a more "global" consortium and both SPSS and NCR confirmed their intention to be part of the consortium (if they are not the only two members)
Next CRISP meeting is tentatively planned in San Diego, in conjunction with KDD-99.
If you are interested in participating in setting the global standards for the data mining process, please contact Thomas Reinartz, DaimlerChrysler, 89081 Ulm, Germany email: thomas.reinartz@daimlerchrysler.com. You can also email to list crisp.sig@dbag.ulm.daimlerbenz.com for discussions.