Predictive Guardbanding: Program-Driven Timing Margin Reduction for GPUs
Abstract
The energy efficiency of GPU architectures has emerged as an essential aspect of computer system design. In this article, we explore the energy benefits of reducing the GPU chip's voltage to the safe limit, i.e., V_{\min } point, using predictive software techniques. We perform such a study on several commercial off-the-shelf GPU cards. We find that there exists about 20% voltage guardband on those GPUs spanning two architectural generations, which, if 'eliminated' entirely, can result in up to 25% energy savings on one of the studied GPU cards. Our measurement results unveil a program dependent V_{\min } behavior across the studied applications, and the exact improvement magnitude depends on the program's available guardband. We make fundamental observations about the program-dependent V_{\min } behavior. We experimentally determine that the voltage noise has a more substantial impact on V_{\min } compared to the process and temperature variation, and the activities during the kernel execution cause large voltage droops. From these findings, we show how to use kernels' microarchitectural performance counters to predict its V_{\min } value accurately. The average and maximum prediction errors are 0.5% and 3%, respectively. The accurate V_{\min } prediction opens up new possibilities of a cross-layer dynamic guardbanding scheme for GPUs, in which software predicts and manages the voltage guardband, while the functional correctness is ensured by a hardware safety net mechanism.