[ PREVIOUS ARTICLE | Table of Contents | NEXT ARTICLE ]

DATA MINING WITH JAVA: AN INTERVIEW WITH JOSE MOREIRA
by Evan Hoovler, assistant editor, DSstar


Portland, Oregon -- At SuperComputing99 Jose Moreira presented a technical paper entitled "Data Mining Using The Array Package For Java". To find out more about this approach, DSstar interviewed its author.

DSstar: Please summarize your paper.

MOREIRA: Following suggestions by the referees, we actually changed the name to "High Performance Computing with the Array Package for Java: A Case Study using Data Mining". The goal of the paper is that you can implement the computational part of a high performance data mining code entirely in Java, and still achieve performance that is comparable to what can be obtained with a highly tuned Fortran code. Using the Array package for Java, we achieve 90% of Fortran performance on a single processor. Furthermore, we automatically exploit multiple processors in a single machine. We actually achieve 292 Mflops on a 4-processor machine.

The particular data mining computation we perform is an example of a irregular application, in which sparse matrices constitute the main data structures. We had already demonstrated the applicability of the Array package for regular applications, with dense matrices. This data mining computation was a good test for the applicability of the Array package in irregular applications as well.

DSstar: Does Java stand to become the data mining tool of choice?

MOREIRA: Java has the opportunity of becoming the data mining tool of choice. It is a great language for developing complex applications, thanks to its portability, safety features, and clean object oriented model. Complex data mining applications are likely to involve a variety of platforms, and the benefits that come from Java's portability are tremendous. The major obstacle for Java that needs to be overcome is performance. Our work shows that we can fix the performance issues in the computation part of a data mining code. Other people at IBM have shown that you can do the same for the data access (file or database connection) part.

DSstar: How suitable is Java for parallel applications? Also, please comment on the scalability of Java for data mining.

MOREIRA: The built-in thread model of Java provides the means to write portable parallel codes. For the particular data mining computation we consider, we took two different approaches to exploiting parallelism, both using Java threads and both fully portable. In one approach we exploited parallelism exclusively inside the Array package. That means the application code remains completely sequential, and all parallelism is exploited transparently to the applications. This is the approach that achieved 292 Mflops on four processors (a speedup of 2.7).

In the other approach we explicitly parallelized the data mining computation, at the application level. This required a bit more work but also delivered better performance, 344 Mflops on four processors (a speedup of 3.1). Either way, it was not too hard to get very good performance and decent scalability from the computation.

DSstar: Is there anything else that you would like to add?

MOREIRA: This was a good team effort, involving people with knowledge of compilers (myself, Sam Midkiff, and Manish Gupta) and data mining (Rick Lawrence). Our joint enthusiasm for Java made is succeed in this endeavor. More information about our work can be found at our web site: http://www.research.ibm.com/ninja. More information and downloadables for the Array package can be found at http://www.alphaWorks.ibm.com/tech/ninja.


Evan Hoovler is assistant editor for DSstar. Comments are always welcome and should be emailed to evan@tgc.com


[ PREVIOUS ARTICLE | Table of Contents | NEXT ARTICLE ]