GPU OPTIMIZATIONS FOR EFFICIENT AND COST-EFFECTIVE ACCESS TO DIVERSE LARGE LANGUAGE MODELS IN RESEARCH CLUSTER
Abstract
Large Language Models (LLMs) are a cornerstone of modern artificial intelligence research, gaining popularity and encouraging adoption in varying domains. The burgeoning interest among researchers in experimenting with those models drives the need to build a dedicated GPU cluster serving an extensive and varied collection of LLMs, specifically designed to facilitate experimental exploration and innovation in the field of AI research. However, this requirement poses significant challenges in GPU resource optimization. The primary problem is ensuring that a wide variety of fast-changing LLM models remain accessible to researchers while minimizing GPU costs and reducing idle time. The fluctuating popularity of models further complicates this; at any given time, only a few models may be in high demand, leading to under-utilization of expensive GPU resources. To address this challenge, the paper explores spatial GPU multiplexing and temporal GPU sharing to improve GPU resource efficiency. By exploring the spatial GPU multiplexing, we find that dynamically adapting MIC partitions to changing model popularity and usage patterns can help maintain a latency Service Level Agreement (SLA) of 50ms/token as well as reducing GPU costs by 50\% and increasing energy efficiency by 35\% in our research cluster. In addition, the project delves into GPU time-sharing and model-swapping techniques, which allow multiple LLMs to share the same GPU resources, with dynamic model swapping and intelligent request routing to serve 10+ models on the same GPU. This approach further minimizes costs and maximizes GPU utilization by ensuring that resources are allocated to models only when needed and that GPU idle times are considerably reduced.