AIOps Essential to Unified Resiliency Management in Data Lakehouses
Abstract
AIOps can provide essential value for data lake-houses as lakehouses pose complex operational challenges for Site Reliability Engineers (SRE). This paper proposes that the unified approach of data lakehouses creates a unique opportunity for unified data resiliency management. We focus on AIOps applied to disaster recovery and backup/restore. In particular, we focus on managing data lakehouse hardware resources to ensure that lakehouse data Recovery Point Objectives (RPO) are met with a high degree of accuracy. The goal is to warn an SRE about an impending RPO violation and to suggest adding given amounts of hardware resources before a given time to avoid violation of the lakehouse data's RPO. We claim AIOps can achieve this goal with an ensemble of machine learning and time series analysis.