Online anomaly prediction for robust cluster systems
Abstract
In this paper, we present a stream-based mining algorithm for online anomaly prediction. Many real-world applications such as data stream analysis requires continuous cluster operation. Unfortunately, today 's large-scale cluster systems are still vulnerable to various software and hardware problems. System administrators are often overwhelmed by the tasks of correcting various system anomalies such as processing bottlenecks (i.e., full stream buffers), resource hot spots, and service level objective (SLO) violations. Our anomaly prediction scheme raises early alerts for impending system anomalies and suggests possible anomaly causes. Specifically, we employ Bayesian classification methods to capture different anomaly symptoms and infer anomaly causes. Markov models are introduced to capture the changing patterns of different measurement metrics. More importantly, our scheme combines Markov models and Bayesian classification methods to predict when a system anomaly will appear in the foreseeable future and what are the possible anomaly causes. To the best of our knowledge, our work provides the first stream-based mining algorithm for predicting system anomalies. We have implemented our approach within the IBM System S distributed stream processing cluster, and conducted case study experiments using fully implemented distributed data analysis applications processing real application workloads. Our experiments show that our approach efficiently predicts and diagnoses several bottleneck anomalies with high accuracy while imposing low overhead to the cluster system. © 2009 IEEE.