بهبود عملکرد طبقه بندی با استفاده از الگوریتم انتخاب ویژگی برای داده کاوی
ترجمه نشده

بهبود عملکرد طبقه بندی با استفاده از الگوریتم انتخاب ویژگی برای داده کاوی

عنوان فارسی مقاله: بهبود عملکرد طبقه بندی با استفاده از الگوریتم انتخاب ویژگی زیرمجموعه تصادفی برای داده کاوی
عنوان انگلیسی مقاله: Classification Performance Improvement Using Random Subset Feature Selection Algorithm for Data Mining
مجله/کنفرانس: تحقیقات کلان داده - Big Data Research
رشته های تحصیلی مرتبط: مهندسی کامپیوتر
گرایش های تحصیلی مرتبط: مهندسی الگوریتم ها و محاسبات
کلمات کلیدی فارسی: جنگل تصادفی، انتخاب ویژگی زیرمجموعه، کاهش ابعاد، اطلاعات علمی، پایداری
کلمات کلیدی انگلیسی: Random forest، Subset feature selection، Dimensionality reduction، Scientific data، Stability
نوع نگارش مقاله: مقاله پژوهشی (Research Article)
شناسه دیجیتال (DOI): https://doi.org/10.1016/j.bdr.2018.02.007
دانشگاه: Department of IT, Anurag Group of Institutions, Hyderabad, India
صفحات مقاله انگلیسی: 32
ناشر: الزویر - Elsevier
نوع ارائه مقاله: ژورنال
نوع مقاله: ISI
سال انتشار مقاله: 2018
ایمپکت فاکتور: 7/184 در سال 2017
شاخص H_index: 12 در سال 2019
شاخص SJR: 0/757 در سال 2017
شناسه ISSN: 2214-5796
شاخص Quartile (چارک): Q1 در سال 2017
فرمت مقاله انگلیسی: PDF
وضعیت ترجمه: ترجمه نشده است
قیمت مقاله انگلیسی: رایگان
آیا این مقاله بیس است: خیر
کد محصول: E11092
فهرست مطالب (انگلیسی)

Abstract

1- Introduction

2- Dimensionality reduction techniques related work

3- About the existing RSFS algorithm

4- Proposed RSFS algorithm

5- Experiments and results

6- Conclusion

References

بخشی از مقاله (انگلیسی)

Abstract

This study focuses on feature subset selection from high dimensionality databases and presents modification to the existing Random Subset Feature Selection (RSFS) algorithm for the random selection of feature subsets and for improving stability. A standard k-nearest-neighbor (kNN) classifier is used for classification. The RSFS algorithm is used for reducing the dimensionality of a data set by selecting useful novel features. It is based on the random forest algorithm. The current implementation suffers from poor dimensionality reduction and low stability when the database is very large. In this study, an attempt is made to improve the existing algorithm's performance for dimensionality reduction and increase its stability. The proposed algorithm was applied to scientific data to test its performance. With 10 fold cross-validation and modifying the algorithm classification accuracy is improved. The applications of the improved algorithm are presented and discussed in detail. From the results it is concluded that the improved algorithm is superior in reducing the dimensionality and improving the classification accuracy when used with a simple kNN classifier. The data sets are selected from public repository. The datasets are scientific in nature and mostly used in cancer detection. From the results it is concluded that the algorithm is highly recommended for dimensionality reduction while extracting relevant data from scientific datasets.

Introduction

Data mining, the extraction of useful hidden features from large databases, is an effective new innovation with incredible potential to help organizations, focus on developing business strategies. The tools, developed for mining data, anticipate future patterns and practices, permitting organizations to make proactive, learning-driven choices. Many data mining tools can address business challenges more effectively than can traditional query or report-based tools. The performance of traditional tool’s is very poor because of the large quantities of data involved. However, large quantities of data might sometimes result in poor performance in data analytics applications as well. Most data mining algorithms are implemented column-wise, which makes them become slower as the number of features increases. When the quantity of collected data is very large, mining for relevant data is a challenge. This is known as the "curse Of dimensionality"[1, 2, 3, 4]. Hence, there is a need for reducing the dimensionality of data without compromising the intrinsic geometric properties. Several methods have been developed, as shown in Figure (1), to address the challenge. Especially in the fields of bio-medical engineering, drug testing, cancer research, the data quantities involved are huge, and collecting them is very expensive. The data generated from experiments, in the above-mentioned fields are popularly known as scientific data. Such scientific data are tend to be noisy and sparse in nature [5, 6]. Because of this, standard data mining tools often do not perform efficiently when applied to scientific data. In this paper, an attempt is made to improve the existing random subset feature selection(RSFS) algorithm for better dimensionality reduction when applied on scientific data. Scientific data sets result from extensive research in fields such as cancer research, bio-informatics, medical diagnosis, genetic engineering and weather studies. These data sets are sparse in nature. For example, cancer, also called malignancy, is an abnormal growth of cells. For cancer treatment chemotherapy, radiation, and/or surgery may be required according to the severity of the disease. In this study, we attempted to reduce the number of features to aid in the detection of cancer, leading to time savings and saved lives. In this paper, we propose a dimensionality reduction on features when applied to cancer data sets. We describe and evaluate our approach in 4 phases: (1) improvement of the random subset feature selection(RSFS) algorithm, (2)the two-sample t-test to ascertain whether the difference between the existing and proposed algorithms is significant, (3) a box plot comparing proposed algorithm’s performance with that of the existing algorithm when datasets are from two classes are of a multi-class labeled type, and (4)stability enhancement for a stable feature subset. This paper is organized into 6 sections: 1. Introduction(this section), 2. Dimensionality Reduction Techniques Related Work, 3. About the existing RSFS, 4. Proposed RSFS Algorithm(present work), 5. Experiments and Results, and 6. Conclusion.