Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals
Abstract
Cloud based microservice architecture has become a powerful mechanism in helping organizations to scale operations by accelerating the pace of change at minimal cost. With cloud based applications being accessed from diverse geographies, there is a need for round-the-clock monitoring of faults to prevent or to limit the impact of outages. Pinpointing source(s) of faults in cloud applications is a challenging problem due to complex interdependencies between applications, middleware, and hardware infrastructure all of which may be subject to frequent and dynamic updates. In this paper, we propose a light-weight fault localization technique, which can reduce human effort and dependency on domain knowledge for localizing observable operational faults. We model multivariate error-rate time series using minimal runtime logs to infer causal relationship among the golden signal errors (error rates) and micro-service errors to discover ranked list of possible faulty components. Our experimental results show that our system can localize operational faults with high accuracy (F1 = 88.4%) underscoring the effectiveness of using golden signal error rates in fault localization.