Kenneth L. Clarkson, Elad Hazan, et al.
Journal of the ACM
Energy efficiency is pressing in today’s cloud datacenters. Various power management strategies, such as oversubscription, power capping, and dynamic voltage and frequency scaling, have been proposed and are in use by datacenter operators to better control power consumption at any management unit (e.g., node-level or rack-level) without breaking power budgets. In addition, by gaining more control over different management units within a datacenter (or across datacenters), operators are able to shift the energy consumption either spatially or temporally to optimize carbon footprint based on the spatio-temporal patterns of carbon intensity. The drive for automation has resulted in the exploration of learning-based resource management approaches. In this work, we first systematically investigate the impact of power capping on datacenter workloads and learning-based resource management solutions (i.e., reinforcement learning or RL). We show that even power capping leads to an 18% degradation in resource management effectiveness (i.e., defined by an RL reward function) and thus 50% higher application latency. We then propose PALM, an adaptive resource allocation framework that provides graceful performance-preserving transition under power capping for latency-critical workloads. Evaluation results show that PALM achieves 10.2–99.3% improvement in SLO preservation under power capping while saving 3.1–5.8% utilization.
Kenneth L. Clarkson, Elad Hazan, et al.
Journal of the ACM
Yuankai Luo, Veronika Thost, et al.
NeurIPS 2023
Aditya Malik, Nalini Ratha, et al.
CAI 2024
Stephen Obonyo, Isaiah Onando Mulang’, et al.
NeurIPS 2023