Optimizing GPU Multiplexing for Efficient and Cost-Effective Access to Diverse Large Language Models in GPU Clusters
Abstract
Large Language Models (LLMs) are a cornerstone of modern artificial intelligence research, gaining popularity and encouraging adoption in varying domains. The burgeoning inter- est among researchers drives the need for building a dedicated GPU cluster serving an extensive and varied collection of LLMs, specifically designed to facilitate experimental exploration and innovation in the field of AI research. However, this requirement poses significant challenges in optimizing GPU multiplexing. The primary problem is ensuring that a wide variety of fast-changing LLM models remain accessible to researchers while minimizing GPU costs and reducing idle time. This is further complicated by the fluctuating popularity of models; at any given time, only a few models would be in high demand, leading to underutilization of expensive GPU resources. To address this challenge, the paper presents a methodology that involves profiling the performance of various LLMs across different Multi-Instance GPU (MIG) slice configurations under various loads. By analyzing the trade- offs between GPU utilization and LLM inference performance on vLLM server, we identify the optimal MIG partitioning strategies that can dynamically adapt to the changing landscape of model popularity and usage patterns with latency SLA retained (50ms/token), thereby reducing the GPU cost by 50% and enhancing the energy efficiency by 35% in our GPU cluster.