Vela: A Virtualized LLM Training System with GPU Direct and RoCE

Apoorve Mohan; Robert Walkup; Bengi Karacali-Akyamac; Ming-Hung Chen; Apo Kayi; Liran Schour; Shweta Salaria; Sophia Wen; I-Hsin Chung; Alim Alim; Constantinos Evangelinos; Lixiang Luo; Marc Dombrowa; Laurent Schares; Ali Sydney; Pavlos Maniotis; Sandhya Koteshwara; Brent Tang; Joel Belog; Rei Odaira; Vasily Tarasov; Eran Gampel; Drew Thorstensen; Talia S. Gershon; Seetharami Seelam

ASPLOS 2025

Conference paper

30 Mar 2025

Vela: A Virtualized LLM Training System with GPU Direct and RoCE

View publication

Abstract

Vela is a cloud-native system designed for LLM training workloads built using off-the-shelf hardware, Linux KVM-based virtualization, and a virtualized RDMA over Converged Ethernet (RoCE) network. Vela virtual machines (VMs) support peer-to-peer DMA between the GPUs and SRIOV-based network interface.

In this paper, we share Vela's key architectural aspects with details from an Nvidia A100 GPU-based deployment in one of our data centers. Throughout the paper, we share insights and experiences from designing, building, and operating the system over a ~2.5 year timeframe to highlight the capabilities of readily available software and hardware technologies and the improvement opportunities for future AI systems, thereby making AI infrastructure more accessible to a broader community. As we evaluated the system for performance at ~1500 GPU scale, we achieved ~80% of the ideal throughput while training a 50 billion parameter decoder model using model parallelism, and ~70% per GPU FLOPS compared to a single VM with the High-Performance Linpack benchmark.

Workshop paper