Abstract
1. Introduction
2. Review of approaches for subspace clustering
3. Parameter-free subspace clustering
4. Proposed algorithm
5. Experimental results
6. Conclusion and future scope
Funding
Declaration of Competing Interest
CRediT authorship contribution statement
References
Abstract
Many approaches have been proposed to recognize clusters in subspaces. However, their performance is highly sensitive to input parameter values. The purpose and expected ranges of these parameters may not available to a non-expert user. The parameter setting producing optimal results can only be known after repeated execution of the clustering process every time with a different set, which is very time consuming. Most of the existing algorithms show high runtimes due to excessive data scans. In this work, we propose a subspace clustering technique that estimates the distance threshold parameter automatically from the data for each attribute and works on the basis of single linkage clustering, in bottom up, greedy fashion. The experimental results show that, the algorithm produces optimal results without accepting any input from the user, achieves up to 10 times better runtime and improved accuracy in a single run without requiring any tuning of parameter values.
Introduction
Many recent applications capture huge volumes of data that is big in both directions, i.e. terms of objects and attributes. Comprehending such huge datasets is now beyond human capability and requires use of powerful and automated data mining tools. This wealth of data is of no use unless the knowledge embedded within it, is not uncovered by applying appropriate data mining algorithms. Clustering is an unsupervised data-mining task and no class labels are present in the data except for the test data wherein these labels can be used to verify quality of clustering. The attributes describing the objects are also called as dimensions or variables and the objects represent vertices in multi-dimensional space described by these attributes. In the literature, datasets having more than ten dimensions are called as high-dimensional data (Han, Kamber, & Pei, 2012). Conventional clustering algorithms show unusual and incorrect behavior on such datasets, the reason being they compute inter-cluster similarity based on full attribute space. Such distance calculations result in a dissimilarity value which is nearly equal for all object pairs. Thus all the data points are equally similar or dissimilar due to an effect called “Curse of Dimensionality” (Bellman, 1961) observed in high dimensional data spaces. Hence the underlying clustering algorithm cannot find clusters or natural grouping of the objects based on this similarity value and show inability to mine meaningful clusters in high-dimensional data spaces.