Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Vela is a cloud-native system designed for LLM training workloads built using off-the-shelf hardware, Linux KVM-based virtualization, and a virtualized RDMA over Converged Ethernet (RoCE) network. Vela virtual machines (VMs) support peer-to-peer DMA between the GPUs and SRIOV-based network interface.
In this paper, we share Vela's key architectural aspects with details from an Nvidia A100 GPU-based deployment in one of our data centers. Throughout the paper, we share insights and experiences from designing, building, and operating the system over a ~2.5 year timeframe to highlight the capabilities of readily available software and hardware technologies and the improvement opportunities for future AI systems, thereby making AI infrastructure more accessible to a broader community. As we evaluated the system for performance at ~1500 GPU scale, we achieved ~80% of the ideal throughput while training a 50 billion parameter decoder model using model parallelism, and ~70% per GPU FLOPS compared to a single VM with the High-Performance Linpack benchmark.
Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Deming Chen, Alaa Youssef, et al.
arXiv
Jose Manuel Bernabe' Murcia, Eduardo Canovas Martinez, et al.
MobiSec 2024
Sahil Suneja, Yufan Zhuang, et al.
ACM TOSEM