خوشه بندی فضاهای فرعی حریصانه
ترجمه نشده

خوشه بندی فضاهای فرعی حریصانه

عنوان فارسی مقاله: تخمین آستانه مسافت برای خوشه بندی فضاهای فرعی حریصانه
عنوان انگلیسی مقاله: Estimating distance threshold for greedy subspace clustering
مجله/کنفرانس: سیستم های خبره با کابردهای مربوطه – Expert Systems with Applications
رشته های تحصیلی مرتبط: مهندسی کامپیوتر
گرایش های تحصیلی مرتبط: مهندسی الگوریتم و محاسبات
کلمات کلیدی فارسی: خوشه بندی فضاهای فرعی حریصانه، برآورد پارامتر، خوشه بندی پیوندی واحد
کلمات کلیدی انگلیسی: Greedy subspace clustering، Parameter estimation، Single linkage clustering
نوع نگارش مقاله: مقاله پژوهشی (Research Article)
شناسه دیجیتال (DOI): https://doi.org/10.1016/j.eswa.2019.06.011
دانشگاه: Department of Computer Science & Engineering, Sanjay Ghodawat University, Atigre, Kolhapur, Maharashtra 416118, India
صفحات مقاله انگلیسی: 18
ناشر: الزویر - Elsevier
نوع ارائه مقاله: ژورنال
نوع مقاله: ISI
سال انتشار مقاله: 2019
ایمپکت فاکتور: 5.891 در سال 2018
شاخص H_index: 162 در سال 2019
شاخص SJR: 1.190 در سال 2018
شناسه ISSN: 0957-4174
شاخص Quartile (چارک): Q1 در سال 2018
فرمت مقاله انگلیسی: PDF
وضعیت ترجمه: ترجمه نشده است
قیمت مقاله انگلیسی: رایگان
آیا این مقاله بیس است: خیر
آیا این مقاله مدل مفهومی دارد: ندارد
آیا این مقاله پرسشنامه دارد: ندارد
آیا این مقاله متغیر دارد: ندارد
کد محصول: E13566
رفرنس: دارای رفرنس در داخل متن و انتهای مقاله
فهرست مطالب (انگلیسی)

Abstract

1. Introduction

2. Review of approaches for subspace clustering

3. Parameter-free subspace clustering

4. Proposed algorithm

5. Experimental results

6. Conclusion and future scope

Funding

Declaration of Competing Interest

CRediT authorship contribution statement

References

بخشی از مقاله (انگلیسی)

Abstract

Many approaches have been proposed to recognize clusters in subspaces. However, their performance is highly sensitive to input parameter values. The purpose and expected ranges of these parameters may not available to a non-expert user. The parameter setting producing optimal results can only be known after repeated execution of the clustering process every time with a different set, which is very time consuming. Most of the existing algorithms show high runtimes due to excessive data scans. In this work, we propose a subspace clustering technique that estimates the distance threshold parameter automatically from the data for each attribute and works on the basis of single linkage clustering, in bottom up, greedy fashion. The experimental results show that, the algorithm produces optimal results without accepting any input from the user, achieves up to 10 times better runtime and improved accuracy in a single run without requiring any tuning of parameter values.

Introduction

Many recent applications capture huge volumes of data that is big in both directions, i.e. terms of objects and attributes. Comprehending such huge datasets is now beyond human capability and requires use of powerful and automated data mining tools. This wealth of data is of no use unless the knowledge embedded within it, is not uncovered by applying appropriate data mining algorithms. Clustering is an unsupervised data-mining task and no class labels are present in the data except for the test data wherein these labels can be used to verify quality of clustering. The attributes describing the objects are also called as dimensions or variables and the objects represent vertices in multi-dimensional space described by these attributes. In the literature, datasets having more than ten dimensions are called as high-dimensional data (Han, Kamber, & Pei, 2012). Conventional clustering algorithms show unusual and incorrect behavior on such datasets, the reason being they compute inter-cluster similarity based on full attribute space. Such distance calculations result in a dissimilarity value which is nearly equal for all object pairs. Thus all the data points are equally similar or dissimilar due to an effect called “Curse of Dimensionality” (Bellman, 1961) observed in high dimensional data spaces. Hence the underlying clustering algorithm cannot find clusters or natural grouping of the objects based on this similarity value and show inability to mine meaningful clusters in high-dimensional data spaces.