Abstract
1- Introduction
2- Big data and social media mining
3- Bayesian networks: An introduction
4- Data integration methodology of social media with survey data
5- CaseStudy: the airport passengers datasets
6- Application of the big data integration methodology
7- Discussion and conclusions
References
Abstract
In recent years, the growing availability of huge amounts of information, generated in every sector at high speed and in a wide variety of forms and formats, is unprecedented. The ability to harness big data is an opportunity to obtain more accurate analyses and to improve decision-making in industry, government and many other organizations. However, handling big data may be challenging and proper data integration is a key dimension in achieving high information quality. In this paper, we propose a novel approach to data integration that calibrates online generated big data with interview based customer survey data. A common issue of customer surveys is that responses are often overly positive, making it difficult to identify areas of weaknesses in organizations. On the other hand, online reviews are often overly negative, hampering an accurate evaluation of areas of excellence. The proposed methodology calibrates the levels of unbalanced responses in different data sources via resampling and performs data integration using Bayesian Networks to propagate the new re-balanced information. In this paper we show, with a case study example, how the novel data integration approach allows businesses and organizations to get a bias corrected appraisal of the level of satisfaction of their customers. The application is based on the integration of online data of review blogs and customer satisfaction surveys from the San Francisco airport. We illustrate how this integration enhances the information quality of the data analytic work in four of InfoQ dimensions, namely, Data Structure, Data Integration, Temporal Relevance and Chronology of Data and Goal.
Introduction
The growing availability of abundant masses of data in every sector, including business, government and health care, is posing new analytic and statistical challenges. This data may come from different sources such as posts in social media sites, digital pictures and videos, cell phone GPS, purchase transaction records and signal sensors used to gather climate information, to name a few. This is called Big Data and is characterized by high volume, variety and gathering velocity. Large quantities of information, mostly unstructured, are generated by social media, every minute. On the web, billions of individuals around the globe simultaneously produce, share and consume content generated by the user themselves. Through social media people express their opinions and sentiments towards specific topics, products and services, and the analysis of this information (called social media mining or sentiment analysis) may be key to organizations and businesses to monitor the satisfaction of their customers or to plan business initiatives or design new products and services. In recent years, advances in the literature of big data analysis have been significant. Amongst recent contributions to sentiment analysis, Stander, Dalla Valle, and Cortina Borja (2016a) and Stander, Dalla Valle, Eales, Baldino, and Cortina Borja (2016b), extracted Facebook data to analyze sentiment scores and voting patterns about the June 2016 EU referendum in the UK. Zhang, Fuehres, and Gloor (2011) used sentiment analysis techniques to predict stock market indicators using Twitter data. Asur and Huberman (2010) predicted box-office movie revenues, performing an analysis of sentiments from comments posted on social media.