Modelling Performance and Energy Efficiency of AI and HPC Workloads in Heterogeneous Environments
Abstract
Rapid advancements in artificial intelligence (AI) rely on heterogeneous high performance computing (HPC) resources. HPC systems are expensive and power hungry and are typically used for both AI and traditional HPC workloads. Being able to model and predict performance and energy efficiency characteristics of these workloads in such complex environments is of paramount importance for co-design of cost efficient and sustainable applications and systems. We present our ongoing work on modelling performance and energy efficiency of AI and HPC workloads in heterogeneous environments using a data driven approach. We are developing a software toolkit, which collects runtime performance and power consumption metrics of a workload, combines them with the information about system and application hyperparameters and uses a deep learning (DL) regression model to predict and optimize a workload performance and energy efficiency. The toolkit is based on serval open source technologies and is designed for deployment on hybrid cloud. It offers several unique capabilities compared to other efforts in the area. Some preliminary results on modelling performance and energy efficiency of HPC and AI workloads are included.