Anomaly Detection on IBM Z Mainframes: Performance Analysis and More
Abstract
Anomalous events can signal a variety of problems in any system. As such, robust, fast detection of anomalies is important so issues can be fixed before they cascade to create larger problems. In this paper we focus on IBM Z mainframes, although most of the problems addressed and techniques used are broadly applicable. For example, anomalies can signal issues such as disk malfunctions, slow or unresponsive modules, crashes and latent bugs, lock contention, excessive retries, the need to allocate more resources to reduce contention, etc. Although there are specific techniques for addressing individual issues, anomaly detection is useful in its broad spectrum utility, and its ability to identify combinations of problems for which there may not be a specific approach implemented. In addition, anomaly detection serves as a backstop: truly anomalous events suggest that normal mechanisms did not work.Our input for detecting anomalies is low-level, summarized information available in time series to the zOS operating system. Although such information lacks some high-level context, it does provide an operating system awareness that benefits from universal applicability to any zOS system and any code running on such a system. The data is also quite rich with 100 - 100,000 metrics per sample depending on how "metric"is defined. As might be expected the data contains metrics such as CPU utilization, execution priorities, internal locking behavior, bytes read and written by an executing process, etc. It also contains higher-level information such as the executing process names or transactional identifiers from online transactional processing facilities. Names are useful not only in detecting anomalies, but in conveying context to users trying to isolate and fix problems.Our techniques build on KL divergence [21] and learn continuously without supervision and with low overhead. Continuous learning is important. The first instance of an aberrant behavior is an anomaly. The 10th instance probably is not. This point also illustrates the utility of anomaly detection in pinpointing root cause: early detection is essential and broad-spectrum anomaly detection provides excellent capability to do just that.This paper outlines these techniques and demonstrates their efficacy in detecting and resolving key problems.