Detection, Diagnosis and Remediation for IT Incidents powered by Generative AI
Abstract
The fast-increasing complexity of modern IT in multi cloud environments is bringing unprecedented management challenges to Site Reliability Engineers (SREs) to meet Service Level Objectives (SLOs) and keep systems up and running effectively. To put in perspective, an availability SLO of 99.99% allows for 4.3 minutes of downtime per month, hardly something that can be attained by simply reacting to incidents. In this demo, we introduce our approach to address this challenge by transforming ITOps from being reactive to becoming proactive by leveraging large language models and advanced AI capabilities. The main goal of our work is to automate as much as possible the implementation of resolutions for upcoming IT issues before they turn into outages. Our demo consists of four steps: (1) Issue Detection, where we have developed an unsupervised methodology for detecting issues via ensemble of various anomaly detectors. We compare our methods with the state-of-the-art techniques implemented in the Salesforce Merlion library. (2) Issue Diagnosis, where we have developed language model based log data representation, built an AI system for probable cause identification using novel causal analysis and reinforcement learning, complemented with LLM-based summarization techniques easing consumption of diagnosis results by SREs and by downstream issue resolution analytics. We compare our methods with the state-of-the-art techniques implemented in the Salesforce PyRCA library; (3) Action Recommendation, which leverages state-of-the-art generative AI techniques to produce actionable recommendations; (4) Automation, where action recommendation outputs are transformed into code that can be executed to resolve the incidents.