Poster

DEFT: SLO-Driven Preemptive Scheduling for Containerized DNN Serving

Abstract

With GPU servers increasingly shared by containerized DNNs that have highly diverse SLOs of inference delay, we observe an emerging need for a scheduler that, without changing container applications, can dynamically estimate the remaining time of each DNN job, in order to determine which kernel calls should preempt the incumbent DNN inference on a shared GPU. This project presents such a scheduler on top of Kubernetes called DEFT. Our preliminary results show that compared to existing solutions, \name reduces SLO violations, because (1) it allows preempting a DNN inference in kernel-level rather than treating DNN inference as a whole, and (2) it makes preemption decisions based on the remaining time of each competing DNN job, rather than static weight per DNN job or the duration of individual kernel calls.

Related