CloudRanger: Root cause identification for cloud native systems
Abstract
As more and more systems are migrating to cloud environment, the cloud native system becomes a trend. This paper presents the challenges and implications when diagnosing root causes for cloud native systems by analyzing some real incidents occurred in IBM Bluemix (a large commercial cloud). To tackle these challenges, we propose CloudRanger, a novel system dedicated for cloud native systems. To make our system more general, we propose a dynamic causal relationship analysis approach to construct impact graphs amongst applications without given the topology. A heuristic investigation algorithm based on second-order random walk is proposed to identify the culprit services which are responsible for cloud incidents. Experimental results in both simulation environment and IBM Bluemix platform show that CloudRanger outperforms some state-of-the-art approaches with a 10% improvement in accuracy. It offers a fast identification of culprit services when an anomaly occurs. Moreover, this system can be deployed rapidly and easily in multiple kinds of cloud native systems without any predefined knowledge.