Abstract
۱٫ Introduction
۲٫ Antecedents
۳٫ Related works
۴٫ Methodology
۵٫ Empirical research
۶٫ Conclusion
CRediT authorship contribution statement
Declaration of Competing Interest
References
Abstract
Clustering is the most widely used unsupervised machine learning technique, having extensive applications in statistical analysis. We have multiple clustering algorithms available in theory and many more implementations available in practice. A bunch of literatures can be found focusing on the quality of clustering algorithms using various internal and external evaluation techniques. The motivation behind this work is the scarcity of literatures dealing with performance of clustering algorithms in terms of turnaround time. This paper summarizes the experimental analysis conducted on the performance of multiple clustering algorithms based on cardinality and dimensionality. The analysis is performed in R, which is a free and open source programming language mainly used for statistical computing. This work evaluates nine key algorithms coming under partitioning, hierarchical, density-based and model-based clustering approaches using different social media data sets. We captured performance trends of these algorithms in terms of turnaround time by varying the cardinality and dimensionality parameters of the data sets. Based on our experiments, CLARA, CLARANS, and k-means algorithms demonstrate best performances with varying cardinality. It is also observed that changes in dimensionality do not impact hierarchical clustering approaches whereas there is a positive influence on the execution time for partitioning, density-based and model-based clustering approaches.
Introduction
Data mining [1] is the process of extracting meaningful information from raw data through which underlying patterns and relationships are revealed. These revelations form useful knowledge that can be made use of various scientific, educational, and/or industrial scenarios. Based on the type of patterns to be processed, we can adopt appropriate data mining strategies which include, but not limited to classification, clustering, association, regression, etc. Clustering is the machine learning technique used for creating logical groups of similar entities from a data set. The aim of clustering process is to create distinct groups of elements in such a way that the entities from the same group will have similar properties whereas entities from different groups have dissimilar properties. It is an unsupervised learning technique which is widely used for per-forming statistical analysis of data. Since the volume of data being processed is increasing on a daily basis, clustering is extensively applied in almost all industrial segments. This work covers an empirical analysis of the performance of nine different clustering algorithms [2]. We captured the average processing time for each algorithm against varying number of records (cardinality) with constant number of attributes (dimensionality), and varying number of attributes with same number of records. The experiments were conducted using two distinct social media data sets.