Abstract
1- Introduction
2- Dimensionality reduction techniques related work
3- About the existing RSFS algorithm
4- Proposed RSFS algorithm
5- Experiments and results
6- Conclusion
References
Abstract
This study focuses on feature subset selection from high dimensionality databases and presents modification to the existing Random Subset Feature Selection (RSFS) algorithm for the random selection of feature subsets and for improving stability. A standard k-nearest-neighbor (kNN) classifier is used for classification. The RSFS algorithm is used for reducing the dimensionality of a data set by selecting useful novel features. It is based on the random forest algorithm. The current implementation suffers from poor dimensionality reduction and low stability when the database is very large. In this study, an attempt is made to improve the existing algorithm's performance for dimensionality reduction and increase its stability. The proposed algorithm was applied to scientific data to test its performance. With 10 fold cross-validation and modifying the algorithm classification accuracy is improved. The applications of the improved algorithm are presented and discussed in detail. From the results it is concluded that the improved algorithm is superior in reducing the dimensionality and improving the classification accuracy when used with a simple kNN classifier. The data sets are selected from public repository. The datasets are scientific in nature and mostly used in cancer detection. From the results it is concluded that the algorithm is highly recommended for dimensionality reduction while extracting relevant data from scientific datasets.
Introduction
Data mining, the extraction of useful hidden features from large databases, is an effective new innovation with incredible potential to help organizations, focus on developing business strategies. The tools, developed for mining data, anticipate future patterns and practices, permitting organizations to make proactive, learning-driven choices. Many data mining tools can address business challenges more effectively than can traditional query or report-based tools. The performance of traditional tool’s is very poor because of the large quantities of data involved. However, large quantities of data might sometimes result in poor performance in data analytics applications as well. Most data mining algorithms are implemented column-wise, which makes them become slower as the number of features increases. When the quantity of collected data is very large, mining for relevant data is a challenge. This is known as the "curse Of dimensionality"[1, 2, 3, 4]. Hence, there is a need for reducing the dimensionality of data without compromising the intrinsic geometric properties. Several methods have been developed, as shown in Figure (1), to address the challenge. Especially in the fields of bio-medical engineering, drug testing, cancer research, the data quantities involved are huge, and collecting them is very expensive. The data generated from experiments, in the above-mentioned fields are popularly known as scientific data. Such scientific data are tend to be noisy and sparse in nature [5, 6]. Because of this, standard data mining tools often do not perform efficiently when applied to scientific data. In this paper, an attempt is made to improve the existing random subset feature selection(RSFS) algorithm for better dimensionality reduction when applied on scientific data. Scientific data sets result from extensive research in fields such as cancer research, bio-informatics, medical diagnosis, genetic engineering and weather studies. These data sets are sparse in nature. For example, cancer, also called malignancy, is an abnormal growth of cells. For cancer treatment chemotherapy, radiation, and/or surgery may be required according to the severity of the disease. In this study, we attempted to reduce the number of features to aid in the detection of cancer, leading to time savings and saved lives. In this paper, we propose a dimensionality reduction on features when applied to cancer data sets. We describe and evaluate our approach in 4 phases: (1) improvement of the random subset feature selection(RSFS) algorithm, (2)the two-sample t-test to ascertain whether the difference between the existing and proposed algorithms is significant, (3) a box plot comparing proposed algorithm’s performance with that of the existing algorithm when datasets are from two classes are of a multi-class labeled type, and (4)stability enhancement for a stable feature subset. This paper is organized into 6 sections: 1. Introduction(this section), 2. Dimensionality Reduction Techniques Related Work, 3. About the existing RSFS, 4. Proposed RSFS Algorithm(present work), 5. Experiments and Results, and 6. Conclusion.