Abstract
1- Introduction
2- Requirements & state of the art
3- Architecture design
4- Results
5- Discussion
6- Conclusions
Acknowledgments
References
Abstract
This article describes the development of an automated configuration of a software platform for Data Analytics that supports horizontal and vertical elasticity to guarantee meeting a specific deadline. It specifies all the components, software dependencies and configurations required to build up the cluster, and analyses the deployment times of different instances, as well as the horizontal and vertical elasticity. The approach followed builds up self-managed hybrid clusters that can deal with different workloads and network requirements. The article describes the structure of the recipes, points out to public repositories where the code is available and discusses the limitations of the approach as well as the results of several experiments.
Introduction
The need for data analytics platforms has raised in the recent years, in parallel to the increase in the computing and data storage requirements, in order to tackle the challenges of data processing. Configuring and operating such platforms is not straightforward and requires non-trivial system administration skills. Data analytics platforms involve multiple components and resources, which must be appropriately linked and cross-configured. In addition, dealing with unpredictable workloads is an operationally complex task that requires dynamically readjusting the resources and reconfiguring them on the fly. In this way, this article presents a set of tools and configuration recipes for deploying a virtual self-managed cluster of computing nodes. The cluster can scale horizontally (in and out), by adding and removing computing resources and reconfiguring them according to the workload, and vertically (up and down), by readjusting the assigned resources to individual jobs dynamically to satisfy a given Quality of Service (QoS). This paper introduces the problem, the software architecture, the automatic deployment tools and recipes, the elasticity mechanism and the experiments, discussing the results obtained. The reminder of the paper is structured as follows. First, Section 2 examines the requirements of a data analytics platform and revises the state of the art related to the work presented in the paper. Then, Section 3 presents the proposed architecture of the platform used to perform data analytics and the mechanisms involved in the elasticity management. Also, a brief analysis of each component involved in the architecture is presented in this section. Section 4 describes the most relevant metrics obtained from the deployment of the self-managed virtual cluster and the execution of several test cases to validate the horizontal and vertical elasticity. Section 5 discusses the main developments and improvements presented in this work in comparison with the state of the art. Finally, Section 6 summarizes the main results, concludes the paper and points to future work.