NIMBLE: A toolkit for the implementation of parallel data mining and machine learning algorithms on mapReduce

Amol Ghoting; Prabhanjan Kambadur; Edwin Pednault; Ramakrishnan Kannan

doi:10.1145/2020408.2020464

KDD 2011

Conference paper

21 Aug 2011

NIMBLE: A toolkit for the implementation of parallel data mining and machine learning algorithms on mapReduce

View publication

Abstract

In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level and ill-suited for implementing ML-DM algorithms. To address this deficiency, we present NIMBLE, a portable infrastructure that has been specifically designed to enable the rapid implementation of parallel ML-DM algorithms. The infrastructure allows one to compose parallel ML-DM algorithms using reusable (serial and parallel) building blocks that can be efficiently executed using MR and other parallel programming models; it currently runs on top of Hadoop, which is an open-source MR implementation. We show how NIMBLE can be used to realize scalable implementations of ML-DM algorithms and present a performance evaluation. Copyright 2011 ACM.

Conference paper