Problem determination in enterprise middleware systems using change point correlation of time series data
Abstract
Clustered enterprise middleware systems employing dynamic workload scheduling are susceptible to a variety of application malfunctions that can manifest themselves in a counterintuitive fashion and cause debilitating damage. Until now, diagnosing problems in that domain involves investigating log files and configuration settings and requires in-depth knowledge of the middleware architecture and application design. This paper presents a method for problem determination using change point detection techniques and problem signatures consisting of a combination of changes (or absence of changes) in different metrics. We implemented this approach on a clustered middleware system and applied it to the detection of the storm drain condition: a debilitating problem encountered in clustered systems with counterintuitive symptoms. Our experimental results show that the system detects 93% of storm drain faults with no false positives. © 2006 IEEE.