Abstract
1- Introduction
2- Distributed classifier training: benefits and pitfalls
3- The Label-Aware Distributed Ensemble Learning (LADEL) model
4- Evaluation
5- Conclusions and future work
References
Abstract
Label-Aware Distributed Ensemble Learning (LADEL) is a programming model and an associated implementation for distributing any classifier training to handle Big Data. It only requires users to specify the training data source, the classification algorithm and the desired parallelization level. First, a distributed stratified sampling algorithm is proposed to generate stratified samples from large, pre-partitioned datasets in a shared-nothing architecture. It executes in a single pass over the data and minimizes inter-machine communication. Second, the specified classification algorithm training is parallelized and executed on any number of heterogeneous machines. Finally, the trained classifiers are aggregated to produce the final classifier. Data miners can use LADEL to run any classification algorithm on any distributed framework, without any experience in parallel and distributed systems. The proposed LADEL model can be implemented on any distributed framework (Drill, Spark, Hadoop, etc.) to speed up the development of its data mining capabilities. It is also generic and can be used to distribute the training of any classification algorithm of any sequential single-node data mining library (Weka, R, scikit-learn, etc.). Distributed frameworks can implement LADEL to distribute the execution of existing data mining libraries without rewriting the algorithms to run in parallel. As a proof-of-concept, the LADEL model is implemented on Apache Drill to distribute the training execution of Weka’s classification algorithms. Our empirical studies show that LADEL classifiers have similar and sometimes even better accuracy to the single-node classifiers and they have a significantly faster training and scoring times.
Introduction
Data mining is the process of discovering hidden patterns in data and using these patterns to predict the likelihood of future events. Several problems can be addressed using data mining like: • Classification: Predict the category (discrete) of a new data point. • Regression: Predict the value (continuous) of a new data point. • Clustering: Split data points into categories. • Association Rules: Find relationships between attributes. In this work, we focus on the Classification problem and ways of making it Big Data ready. Classification is a supervised learning approach consisting of two phases: (1) Training: a classifier is built using historical labeled data (i.e data with known category) and (2) Scoring: the trained classifier is used to predict the category of new data points (i.e with unknown category). With the large volume of Big Data, classifier training time and memory requirements are a real challenge. Scalable distributed data mining libraries like Apache Mahout [1], Cloudera Oryx [2], Oxdata H2O [3], MLlib [4] [5] and Deeplearning4j [6] implement distributed versions of the classification algorithms to run on Hadoop [7] and Spark [8]. Distributing classifier training significantly reduces the training time and enables digesting of Big Data. However, the approach used by scalable libraries requires rewriting the classification algorithms to execute in parallel. The rewriting process is complex, timeconsuming and the quality of the modified algorithm depends entirely on the contributors’ expertise. Thus, scalable libraries fail to support as many algorithms as sequential single-node libraries like R [9], Weka [10], scikitlearn [11] and RapidMiner [12].