Achieving Platform Portability for vLLM by using Triton Autotuning and Remembering it

Burkhard Ringlein; Thomas Parnell

Ray Summit 2024

Talk

30 Sep 2024

Achieving Platform Portability for vLLM by using Triton Autotuning and Remembering it

Abstract

vLLM (https://github.com/vllm-project/vllm) aims to become the de-facto industry standard for serving Large Language Models. vLLM is increasingly being adopted in production and can be executed on NVIDIA GPUS, AMD GPUs, as well as custom accelerators like AWS Inferentia.

However, vLLM’s state-of-the-art performance largely depends on a number of hand-written CUDA kernels. These kernels have typically been carefully optimized for a specific GPU platform and may pose a serious obstacle to the portability of vLLM across different hardware. Open AI Triton (https://github.com/triton-lang/triton) recently emerged as a promising open-source alternative to writing custom CUDA kernels. It enables one to write kernels for execution on GPUs using simple Python code. Triton kernels have been shown to be both highly performant, as well as portable across different GPU platforms. For this reason, Triton is growing in popularity, and vLLM already includes several kernels written in Triton.

Triton comes with a built-in autotuner, which is crucial to enable performance-portability. However, using the autotuner adds a lot more overhead to the kernel launches, in addition to the just-in-time compilation. This overhead comes from the fact that, for every variation in the kernel parameters, the autotuner needs to determine which kernel version performs the best. The resulting high variance in latency is unacceptable for serving applications in production. Consequently, the Triton autotuner is usually not used in vLLM today. Yet, by not using it, the portability of the application is limited, because the performance of the Triton kernels can differ by more than one order of magnitude on different platforms.

To solve this problem, we have developed a “dejavu” mechanism for the Triton autotuner. Our goal was to let the autotuner “remember” earlier executions of the kernel, which happened before the lifetime of the current deployment. This dejavu-mechanism reduces the overhead of the Triton autotuner to zero and therefore enables the usage of the autotuner in production. In addition, the dejavu-mechanism enabled us to develop smart algorithms for exploring the range of possible autotuner configurations leading to even better performance. Our early results show that using Triton with our dejavu-autotuner results in (1) speed-ups of more than 100% for some kernels, (2) enabling competitive performance on different platforms using the same code, as well as (3) reducing the external dependencies of vLLM, future proofing vLLM further.

This talk will also include a demo of some Triton-only vLLM deployments.

Conference paper