DEFT: SLO-Driven Preemptive Scheduling for Containerized DNN Serving
Abstract
With GPU servers increasingly shared by containerized DNNs that have highly diverse SLOs of inference delay, we observe an emerging need for a scheduler that, without changing container applications, can dynamically estimate the remaining time of each DNN job, in order to determine which kernel calls should preempt the incumbent DNN inference on a shared GPU. This project presents such a scheduler on top of Kubernetes called DEFT. Our preliminary results show that compared to existing solutions, \name reduces SLO violations, because (1) it allows preempting a DNN inference in kernel-level rather than treating DNN inference as a whole, and (2) it makes preemption decisions based on the remaining time of each competing DNN job, rather than static weight per DNN job or the duration of individual kernel calls.