Abstract
۱٫ Introduction
۲٫ The PreMiSE approach
۳٫ Offline model training
۴٫ Online failure prediction
۵٫ Evaluation methodology
۶٫ Experimental results
۷٫ Related work
۸٫ Conclusions
Declaration of Competing Interest
Acknowledgments
Appendix A. KPI List
References
Abstract
Many applications are implemented as multi-tier software systems, and are executed on distributed infrastructures, like cloud infrastructures, to benefit from the cost reduction that derives from dynamically allocating resources ondemand. In these systems, failures are becoming the norm rather than the exception, and predicting their occurrence, as well as locating the responsible faults, are essential enablers of preventive and corrective actions that can mitigate the impact of failures, and significantly improve the dependability of the systems. Current failure prediction approaches suffer either from false positives or limited accuracy, and do not produce enough information to effectively locate the responsible faults. In this paper, we present PreMiSE, a lightweight and precise approach to predict failures and locate the corresponding faults in multi-tier distributed systems. PreMiSE blends anomaly-based and signature-based techniques to identify multi-tier failures that impact on performance indicators, with high precision and low false positive rate. The experimental results that we obtained on a Cloud-based IP Multimedia Subsystem indicate that PreMiSE can indeed predict and locate possible failure occurrences with high precision and low overhead.
Introduction
Multi-tier distributed systems are systems composed of several distributed nodes organized in layered tiers. Each tier implements a set of conceptually homogeneous functionalities that provides services to the tier above in the layered structure, while using services from the tier below in the layered structure. The distributed computing infrastructure and the connection among the vertical and horizontal structures make multi-tier distributed systems extremely complex and difficult to understand even for those who developed them. Indeed, runtime failures are becoming the norm rather than the exception in many multi-tier distributed systems, such as ultra large systems [1] systems of systems [2, 3] and cloud systems [4, 5, 6]. In these systems, failures become unavoidable due to both their characteristics and the adoption of commodity hardware. The characteristics that increase the chances of failures are the increasing size of the systems, the growing complexity of the system–environment interactions, the heterogeneity of the requirements and the evolution of the operative environment. The adoption of low quality commodity hardware is becoming common practice in many contexts, notably in cloud systems [7, 8], and further reduces the overall system reliability. Limiting the occurrences of runtime failures is extremely important in many common applications, where runtime failures and the consequent reduced dependability negatively impact on the expectations and the fidelity of the customers, and becomes a necessity in systems with strong dependability requirements, such as telecommunication systems that telecom companies are migrating to cloud-based solutions [7].