About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ISCA 2023
Workshop paper
To virtualize or not to virtualize AI Infrastructure: A perspective
Abstract
Modern data-driven applications (such as AI training, Inference) are powered by Artificial Intelligence (AI) infrastructure. AI infrastructure is often available as bare-metal machines (BMs) in on-premise clusters but as virtual machines (VMs) in most public clouds. Why is this dichotomy of BMs on-prem and VMs in public clouds? What would it take to deploy VMs on AI Systems while delivering baremetal-equivalent performance? We will answer these questions based on experiences building and operationalizing a large-scale AI system called Vela in IBM Cloud. Vela is built on open-source Linux KVM and QEMU technologies where we are able to deliver near-baremetal (within 5% of BM) performance inside VMs. VM-based AI infrastructure not only affords BM performance but also provides cloud characteristics such as elasticity and flexibility in infrastructure management.