دانلود مقاله کاربرد روش SVM موازی ارتقاء یافته مبتنی بر هستی شناسی
ترجمه نشده

دانلود مقاله کاربرد روش SVM موازی ارتقاء یافته مبتنی بر هستی شناسی

عنوان فارسی مقاله: کاربرد روش SVM موازی ارتقاء یافته مبتنی بر هستی شناسی در آموزش فیلترینگ (فیلتر کردن) مقیاس پذیر هرزنامه (اسپم)
عنوان انگلیسی مقاله: An ontology enhanced parallel SVM for scalable spam filter training
مجله/کنفرانس: Neurocomputing
رشته های تحصیلی مرتبط: مهندسی کامپیوتر
گرایش های تحصیلی مرتبط: مهندسی نرم افزار، معماری کامپیوتر
کلمات کلیدی فارسی: فیلترینگ (فیلتر کردن) هرزنامه، ماشین بردار پشتیبان (SVM)، رایانش (محاسبات) موازی، طبقه بندی، نگاشت کاهش
کلمات کلیدی انگلیسی: Spam filtering, Support vector machine, Parallel computing, Classification, MapReduce
نوع نگارش مقاله: مقاله پژوهشی (Research Article)
شناسه دیجیتال (DOI): https://doi.org/10.1016/j.neucom.2012.12.001
دانشگاه: School of Engineering and Design, Brunel University, Uxbridge, Middlesex, UB8 3PH, UK
صفحات مقاله انگلیسی: 13
ناشر: الزویر - Elsevier
نوع ارائه مقاله: ژورنال
نوع مقاله: ISI
سال انتشار مقاله: 2013
ایمپکت فاکتور: 3.779 در سال 2017
شاخص H_index: 100 در سال 2019
شاخص SJR: 1.073 در سال 2017
شناسه ISSN: 0925-2312
شاخص Quartile (چارک): Q1 در سال 2017
فرمت مقاله انگلیسی: PDF
وضعیت ترجمه: ترجمه نشده است
قیمت مقاله انگلیسی: رایگان
آیا این مقاله بیس است: خیر
کد محصول: E11912
فهرست مطالب (انگلیسی)

Abstract

1. Introduction

2. Related work

3. Parallelizing SVM with MapReduce

4. Ontology augmentation

5. Experimental results

6. Conclusions and future work

References

بخشی از مقاله (انگلیسی)

Abstract

Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart.

Introduction

The overall email ecosystem has become the most ubiquitous modern day communication tool. Its popularity as well as widespread availability have also created an opportunity for a lucrative business model based on unsolicited bulk email, or rather spam. The proliferation of spam has reached widespread proportions. Spam damages enabling communication infrastructures as well as lumbers consumers and service providers in its trail. Such impact and subsequently the importance to identify better ways to control and mitigate its consequences are reflected in the attention that spam continuously gets from various perspectives. Machine learning techniques such as Support Vector Machines (SVMs) have been applied in spam filtering [1–5]. SVM training is a computationally intensive process primarily due to its convex quadratic programming challenges associated with the dense Hessian Matrix involved during optimization. Numerous SVM formulations, solvers and architectures for improving SVM performance have been explored and proposed [6,7] including the Message Passing Interface (MPI) based parallel approaches [8–10]. SVM decomposition is another technique for improving SVM training performance [11–13]. Less conventional approaches include the utilization of specialized processing hardware such as Graphics Processing Units (GPUs) [14,15]. A widespread practice is to split the training data and use a number of SVMs to process the individual data chunks. This approach typically splits the training data set into a number of smaller fragments. This in turn reduces the individual training set and the overall training time. Most forms of decomposition which are based on a data splitting strategy approach tend to suffer from issues including convergence and accuracy. Challenges related to chunk aliasing, outlier accumulation and data imbalance/distribution tend to intensify problems in a parallel SVM context. This in turn impacts generalization, accuracy and convergence. Techniques such as random sampling have also been shown to exhibit similar accuracy challenges because of the induced probability distribution variance [16]. On the other hand, selective sampling techniques applied to SVMs try to sample the training data intelligently to maximize the performance of SVMs. However these normally require many scans of the entire data set which incurs high overhead in computation.