Abstract
1- Introduction
2- PRELIMINARIES
3- The Proposed Approach
4- Experimental setup and results
5- Conclusions
6- Acknowledgement
References
Abstract
In this paper, a big data analytic framework is introduced for processing high-frequency data stream. This framework architecture is developed by combining an advanced evolving learning algorithm namely Parsimonious Network Fuzzy Inference System (PANFIS) with MapReduce parallel computation, where PANFIS has the capability of processing data stream in large volume. Big datasets are learnt chunk by chunk by processors in MapReduce environment and the results are fused by rule merging method, that reduces the complexity of the rules. The performance measurement has been conducted, and the results are showing that the MapReduce framework along with PANFIS evolving system helps to reduce the processing time around 22 percent in average in comparison with the PANFIS algorithm without reducing performance in accuracy.
Introduction
The rapid growth of data generated through the Internet, which causes big data, attracts great attention from many stakeholders. This phenomenon takes place in many areas in the real life such as business, management, medical, government, and society administration. Big data is unique of its 4Vs characteristics: volume, velocity, variety, and veracity. Volume related to the number of data generated in the storage, which is associated with the scale of data. Velocity indicates the flow rate of continuous data which is associated with the data streams. Variety of data is char-acterized by the number of different format of the data, whereas veracity shows the uncertainty of data where the data sources need to be validated (trustees, accuracy, and data quality). Big data provides enormous opportunities for government/organizations in discovering and extracting valuable information/knowledge of their system which can be beneficial for them in the decision-making processes. In the business area for example, Wal-Mart collaborated with Hewlett Packard trace every purchase record belonging to their customers from their point-of-sales terminals, where their transactions reach around 267 million per day. This valuable transaction data becomes a key basis for company to improve their benefits by applying pricing strategy and advertising campaigns [10, 6]. In this case, the decision could be made by applying some techniques such as data mining, which is extensively used for many decision-making problems in many real-life applications. However, discovering meaningful insight of big data is challenging due to its 4V’s characteristics, which lie on the difficulties in data capture, data storage, data analysis and data visualization [37, 6]. Therefore, an advanced data mining techniques and technologies are highly necessary to process and analyze big data. Big data is often stored in cloud to support the extensibility and scalability of local storage refers to one characteristic of big data, namely volume. In order to extract valuable information of big data efficiently, there is an urgent demand to modify existing data mining techniques to be scalable in processing large-scale dataset. This issue led to a necessity to the development distributed or parallelize scenario in processing big data. In addition, big data are also generated by the arrival of new instances continuously in either by batches or one by one, known as data stream, emerging from real-world applications[9, 33]. Therefore, it is necessary for machine learning algorithm to adapt to rapidly changing non-stationary data streams. Note that stream processing/mining in the web news domain has been conducted in [39] using eT2Class [31], which is able to handle streaming data[4]. This phenomenon triggers the development of evolving learning algorithms, which are able to learn big data continuously [13] by evolving its model to adjust the shift and drift of the big data pattern.