Cutter Consortium helps companies leverage IT for competitive advantage and business success through its comprehensive range of consulting, training and content, provided by the leading expert practitioners in business and IT.

  For more information on Cutter Consortium's Business Intelligence advisory service, please contact Dennis Crowley at +1 781 641 5125, or e-mail dcrowley@cutter.com.

22 March 2005

PRACTICAL GRID-BASED DATA MINING

Grid computing represents the natural evolution of distributed computing and parallel-processing technologies. Basically, grid computing employs groups of locally or remotely networked machines to work together on specific computational tasks to harness the power of many computers in a network. The primary aim of grid computing is to give IT organizations and application developers the ability to create distributed computing environments that can utilize computing resources on demand. In practice, grid computing can leverage the processing capacity of hundreds, or even thousands, of computers. Thus it can help increase efficiencies and reduce the cost of computing networks by decreasing data-processing time and optimizing resources and distributing workloads, thereby allowing users to achieve much faster results on large operations and at lower costs.

The development of practical grid computing techniques will have a profound impact on the way data is analyzed. In particular, the possibility of utilizing grid-based data mining applications is very appealing to organizations wanting to analyze data distributed across geographically dispersed heterogeneous platforms. Grid-based data mining would allow companies to distribute compute-intensive analytic processing among different resources. Moreover, it might eventually lead to new integration and automated analysis techniques that would allow companies to mine data where it resides. This is in contrast to the current practice of having to extract and move data into a centralized location for mining -- processes that are becoming more difficult to conduct due to the fact that data is becoming increasingly geographically dispersed, and because of security and privacy considerations.

Several major issues, however, stand in the way of practical grid-based data mining. For one, most of the current crop of commercial data-mining tools are suited for use primarily in homogeneous and localized computing environments. In addition, no standard framework currently exists for deploying (distributed) data mining applications in a grid environment.

To address these issues, the European Commission is sponsoring the Data Mining Tools and Services for Grid Computing Environments project. The goal of this ambitious two-year effort -- referred to as "DataMiningGrid" -- is to develop a formal framework for deploying data mining applications in grid environments. It also seeks to develop generic and industry-independent data mining tools and services for grid applications.

Current members of the DataMiningGrid include Daimler Chrysler, University of Ulster (UK), University of Ljubljana (Slovenia), the Fraunhofer Institute for Intelligent Systems (Germany), and Israel Institute of Technology (Israel).

The aim of the DataMiningGrid project is to upgrade data mining technologies in such a way that makes traditional data mining approaches distributed. As a result, the project seeks a number of key development objectives, including:

  • Grid interfaces that will allow data mining tools and data sources to interoperate within distributed grid computing environments

  • Grid-based text mining tools, services, and interfaces

  • Testing environments for demonstrating applications in various industry sectors, including bioinformatics, healthcare, and automotive

  • Alignment and integration of DataMiningGrid technologies with emerging grid standards and infrastructures

The project will also seek to develop Grid Data Mining Analysis Services, which will allow organizations to implement data mining tasks as grid-enabled Web services, rather than specific algorithms. In addition, Grid Data Mining Analysis Services will be packaged in the form of a workflow-based framework that will provide users with a seamless method for implementing distributed data mining analyses. Finally, resulting DataMiningGrid tools and technologies will be made available in the form of Open Source software.

See more information on the DataMiningGrid project.

-- Curt Hall, Senior Consultant, Cutter Consortium

Practical Grid-Based Data Mining