یادگیری گروهی توزیع شده برچسب - آگاهی و یک مدل آموزش طبقه بندی شده
ترجمه نشده

یادگیری گروهی توزیع شده برچسب - آگاهی و یک مدل آموزش طبقه بندی شده

عنوان فارسی مقاله: یادگیری گروهی توزیع شده برچسب - آگاهی: یک مدل آموزش طبقه بندی شده توزیع ساده شده برای داده های بزرگ
عنوان انگلیسی مقاله: Label-Aware Distributed Ensemble Learning: A Simplified Distributed Classifier Training Model for Big Data
مجله/کنفرانس: بررسی کلان داده ها - Big Data Research
رشته های تحصیلی مرتبط: مهندسی کامپیوتر، مهندسی فناوری اطلاعات
گرایش های تحصیلی مرتبط: هوش مصنوعی، مهندسی الگوریتم ها و محاسبات، رایانش ابری
کلمات کلیدی فارسی: داده هاي بزرگ ، تحليل ، توزيع شده، يادگيري ماشين ، طبقه بندي
کلمات کلیدی انگلیسی: Big Data، Analytics، Distributed، Machine learning، Classification
نوع نگارش مقاله: مقاله پژوهشی (Research Article)
نمایه: Scopus - Master Journals List - JCR
شناسه دیجیتال (DOI): https://doi.org/10.1016/j.bdr.2018.11.001
دانشگاه: School of Computing, Queen’s University, Kingston, ON, Canada
صفحات مقاله انگلیسی: 12
ناشر: الزویر - Elsevier
نوع ارائه مقاله: ژورنال
نوع مقاله: ISI
سال انتشار مقاله: 2019
ایمپکت فاکتور: 3/643 در سال 2018
شاخص H_index: 16 در سال 2019
شاخص SJR: 0/984 در سال 2018
شناسه ISSN: 2214-5796
شاخص Quartile (چارک): Q1 در سال 2018
فرمت مقاله انگلیسی: PDF
وضعیت ترجمه: ترجمه نشده است
قیمت مقاله انگلیسی: رایگان
آیا این مقاله بیس است: خیر
آیا این مقاله مدل مفهومی دارد: ندارد
آیا این مقاله پرسشنامه دارد: ندارد
آیا این مقاله متغیر دارد: ندارد
کد محصول: E11523
رفرنس: دارای رفرنس در داخل متن و انتهای مقاله
فهرست مطالب (انگلیسی)

Abstract

1- Introduction

2- Distributed classifier training: benefits and pitfalls

3- The Label-Aware Distributed Ensemble Learning (LADEL) model

4- Evaluation

5- Conclusions and future work

References

بخشی از مقاله (انگلیسی)

Abstract

Label-Aware Distributed Ensemble Learning (LADEL) is a programming model and an associated implementation for distributing any classifier training to handle Big Data. It only requires users to specify the training data source, the classification algorithm and the desired parallelization level. First, a distributed stratified sampling algorithm is proposed to generate stratified samples from large, pre-partitioned datasets in a shared-nothing architecture. It executes in a single pass over the data and minimizes inter-machine communication. Second, the specified classification algorithm training is parallelized and executed on any number of heterogeneous machines. Finally, the trained classifiers are aggregated to produce the final classifier. Data miners can use LADEL to run any classification algorithm on any distributed framework, without any experience in parallel and distributed systems. The proposed LADEL model can be implemented on any distributed framework (Drill, Spark, Hadoop, etc.) to speed up the development of its data mining capabilities. It is also generic and can be used to distribute the training of any classification algorithm of any sequential single-node data mining library (Weka, R, scikit-learn, etc.). Distributed frameworks can implement LADEL to distribute the execution of existing data mining libraries without rewriting the algorithms to run in parallel. As a proof-of-concept, the LADEL model is implemented on Apache Drill to distribute the training execution of Weka’s classification algorithms. Our empirical studies show that LADEL classifiers have similar and sometimes even better accuracy to the single-node classifiers and they have a significantly faster training and scoring times.

Introduction

Data mining is the process of discovering hidden patterns in data and using these patterns to predict the likelihood of future events. Several problems can be addressed using data mining like: • Classification: Predict the category (discrete) of a new data point. • Regression: Predict the value (continuous) of a new data point. • Clustering: Split data points into categories. • Association Rules: Find relationships between attributes. In this work, we focus on the Classification problem and ways of making it Big Data ready. Classification is a supervised learning approach consisting of two phases: (1) Training: a classifier is built using historical labeled data (i.e data with known category) and (2) Scoring: the trained classifier is used to predict the category of new data points (i.e with unknown category). With the large volume of Big Data, classifier training time and memory requirements are a real challenge. Scalable distributed data mining libraries like Apache Mahout [1], Cloudera Oryx [2], Oxdata H2O [3], MLlib [4] [5] and Deeplearning4j [6] implement distributed versions of the classification algorithms to run on Hadoop [7] and Spark [8]. Distributing classifier training significantly reduces the training time and enables digesting of Big Data. However, the approach used by scalable libraries requires rewriting the classification algorithms to execute in parallel. The rewriting process is complex, timeconsuming and the quality of the modified algorithm depends entirely on the contributors’ expertise. Thus, scalable libraries fail to support as many algorithms as sequential single-node libraries like R [9], Weka [10], scikitlearn [11] and RapidMiner [12].