Predicting DRAM reliability in the field with machine learning
Abstract
Uncorrectable errors in dynamic random access memory (DRAM) are a common form of hardware failure in server clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on analyzing DRAM reliability in large production clusters, little has been reported on the automatic prediction of such errors ahead of time. In this paper, we present a highly accurate predictive model, based on daily event logs and sensor measurements, in a large fleet of commodity servers going back to 2014. By correlating correctable errors with sensor metrics, we can use ensemble machine learning techniques to predict uncorrectable errors weeks in advance. In addition, we show how such models can be applied in the wild and consumed by customer support teams. Our goal is to minimize false positives, as healthy DRAMs should not be replaced, while accounting for common limitations, such as missing data points and rare occurences of uncorrectable errors.