Characterization and exploration of latch checkers for efficient RAS protection
Abstract
Reliability has been, and continues to be a key consideration in the design of the IBM Z mainframe processors, and has resulted in industry-leading performance with little-to-no downtime. In this paper, we analyze the various hardware reliability mechanisms that make the processor resilient to transient errors, and the checker architecture that enables their detection and correction. We characterize the error checking logic in the processor based on a detailed analysis of the actual design. Based on hardware measurements on a real Z processor, we then determine the error checkers that are critical from a timing standpoint, in the event where the supply voltage is scaled. We propose algorithms that optimize checker selection without affecting the RAS coverage and the detection of errors induced both due to SER and voltage scaling. Finally we examine further potential optimizations of checkers based on the logic utilization in representative benchmarks.