About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
EuroSys 2024
Workshop paper
Towards Pareto Optimal Throughput in Small Language Model Serving
Abstract
Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks, and show impressive zero-shot and few-shot capabilities in a wide range of applications. Although deploying language models is computationally and memory-intensive, the rise of Small Language Models (SLMs) offers new opportunities for a resource-constrained user, that is now able to serve small models with SOTA performances. Also, it introduces a unique scenario where a single accelerator can manage the memory requirements for storing large batches. Increasing the batch size has been previously associated with compute bound scenarios but there is a lack of experimental support for this intuition, primarily because the focus has been on LLMs where large batch sizes are rarely reached. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis offers a new perspective in serving and opens new doors in multi-model scheduling. Additionally, we provide a first set of results on how model replication can effectively improve resource utilization.