Log Anomaly to Resolution: AI Based Proactive Incident Remediation
Abstract
Based on 2020 SRE report, 80% of SREs work on postmortem analysis of incidents due to lack of provided information and 16% of toil come from investigating false positives/negatives. As a cloud service provider, the desire is to proactively identify signals that can help reduce outages and/or reduce the mean time to resolution. By leveraging AI for Operations (AIOps), this work proposes a novel methodology for proactive identification of log anomalies and its resolutions by sifting through the log lines. Typically, relevant information to retrieve resolutions corresponding to logs is spread across multiple heterogeneous corpora that exist in silos, namely historical ticket data, historical log data, and symptom resolution available in product documentation, for example. In this paper, we focus on augmented dataset preparation from multiple heterogeneous corpora, metadata selection and prediction, and finally, using these elements during run-time to retrieve contextual resolutions for signals triggered via logs. For early evaluation, we used logs from a production middleware application server, predicted log anomalies and their resolutions, and conducted qualitative evaluation with subject matter experts; the accuracy of metadata prediction and resolution retrieval are 78.57% and 65.7%, respectively.