Abstract
1- Introduction
2- Related work
3- TFL strategy
4- HIL Strategy
5- Multiple GPUs and cost model
6- Performance evaluation
7- Summary
References
Abstract
Frequent itemset mining is widely used as a fundamental data mining technique. However, as the data size increases, the relatively slow performances of the existing methods hinder its applicability. Although many sequential frequent itemset mining methods have been proposed, there is a clear limit to the performance that can be achieved using a single thread. To overcome this limitation, various parallel methods using multi-core CPU, multiple machine, or many-core graphic processing unit (GPU) approaches have been proposed. However, these methods still have drawbacks, including relatively slow performance, data size limitations, and poor scalability due to workload skewness. In this paper, we propose a fast GPU-based frequent itemset mining method called GMiner for large-scale data. GMiner achieves very fast performance by fully exploiting the computational power of GPUs and is suitable for large-scale data. The method performs mining tasks in a counterintuitive way: it mines the patterns from the first level of the enumeration tree rather than storing and utilizing the patterns at the intermediate levels of the tree. This approach is quite effective in terms of both performance and memory use in the GPU architecture. In addition, GMiner solves the workload skewness problem from which the existing parallel methods suffer; as a result, its performance increases almost linearly as the number of GPUs increases. Through extensive experiments, we demonstrate that GMiner significantly outperforms other representative sequential and parallel methods in most cases, by orders of magnitude on the tested benchmarks.
Introduction
As a fundamental data mining technique, frequent itemset mining is widely used in a wide range of disciplines such as market basket analysis, web usage mining, social network analysis, intrusion detection, bioinformatics, and recommendation systems. However, the deluge of data generated by automated systems for diagnostic or analysis purposes makes it difficult or even impossible to apply mining techniques in many realworld applications. The existing methods often fail to find frequent itemsets in such big data within a reasonable amount of time. Thus, in terms of computational time, itemset mining is still a challenging problem that has not yet been completely solved. Many sequential frequent itemset mining methods such as Apriori [2], Eclat [35], FP-Growth [14], and LCM [30] use a single CPU thread. However, these singlethreaded applications all have a fundamental mining performance limit because CPU clock speed is generally no longer increasing. To overcome the single-thread performance limit, multiple parallel frequent itemset mining methods have been proposed. These methods can be categorized into three main groups: (1) (CPU-based) multithreaded methods, (2) distributed methods, and (3) graphic processing unit (GPU)- based methods. We omit the term ”multi-thread” from the GPU-based methods because they are obviously multi-threaded. The first group focuses on accelerating the performance of the single-threaded methods by exploiting multi-core CPUs [20, 24, 26, 27, 29], while the second group tries to accelerate the performance by exploiting multiple machines [12, 17, 19]. Details about these methods are available in recent survey studies.