Abstract
1- Introduction
2- Background and related work
3- Current practices in Big Data analytics
4- Semantic model
5- Validation
6- Discussions
7- Conclusions
References
Abstract
Knowledge extraction and incorporation is currently considered to be beneficial for efficient Big Data analytics. Knowledge can take part in workflow design, constraint definition, parameter selection and configuration, human interactive and decision-making strategies. This paper proposes BIGOWL, an ontology to support knowledge management in Big Data analytics. BIGOWL is designed to cover a wide vocabulary of terms concerning Big Data analytics workflows, including their components and how they are connected, from data sources to the analytics visualization. It also takes into consideration aspects such as parameters, restrictions and formats. This ontology defines not only the taxonomic relationships between the different concepts, but also instances representing specific individuals to guide the users in the design of Big Data analytics workflows. For testing purposes, two case studies are developed, which consists in: first, real-world streaming processing with Spark of traffic Open Data, for route optimization in urban environment of New York city; and second, data mining classification of an academic dataset on local/cloud platforms. The analytics workflows resulting from the BIGOWL semantic model are validated and successfully evaluated.
Introduction
In accordance with the recent Gartner’s report 2 , an emerging challenge in Big Data is to construct data-driven intelligent applications that capture and inject domain knowledge in the analytical processes, including context and using a standardized format. Context refers to all the relevant (meta)-information to support the analysis and to help interpreting its results. This will facilitate the integration (in a standardized way) with third parties’ data, algorithms, business intelligence (BI) and visualization services. The use of semantics as contextual information will enhance the analytical power of the algorithms, as well as the reuse of single components in data analytics workflows (Ristoski & Paulheim, 2016). Therefore, the development of ways to make the domain knowledge explicit and usable is needed to improve the data processing and analysis tasks. The Semantic Web technologies can be used to annotate not only the knowledge domain of the data, but also the analytics’ meta-data (Keet et al., 2015), including: algorithms’ parameters, input variables, tuning experiences, expected behaviors and taxonomies. This will facilitate the reuse and composition of Big Data analytics in a proper manner, as well as to enhance the quality of consumed and produced data.