Towards Pareto Optimal Throughput in Small Language Model Serving

Pol G. Recasens; Yue Zhu; Chen Wang; Eun Kyung Lee; Olivier Tardieu; Alaa Youssef; Jordi Torres; Josep Ll. Berral

doi:10.1145/3642970.3655832

EuroMLSys 2024

Conference paper

22 Apr 2024

Towards Pareto Optimal Throughput in Small Language Model Serving

View publication

Abstract

Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.

Conference paper