About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
IPDPS 2023
Conference paper
Chic-sched: a HPC Placement-Group Scheduler on Hierarchical Topologies with Constraints
Abstract
Efficient placement of advanced HPC and AI workloads with application constraints is raising challenges for resource schedulers on shared infrastructures, such as the Cloud. In this work, we propose a novel Constraints- and Heuristics-based scheduler on HIerarchical Topologies for High-Performance Computing workloads in the Cloud (chic-sched, for short). Our heuristics-based algorithm enables placement across multiple levels in a network hierarchy with loosely specified constraints, and it works without retries by providing suboptimal placements to minimize placement failures. This allows for fast scheduling at scale, and the O(N log N) complexity enables placement decisions within tens of milliseconds for groups of hundreds of virtual machines (VM). We introduce a new and simple metric to quantify the goodness of group placements. With this metric, in terms of deviation from ideal placements, we show that chic-sched is 20-50% better than the common bestFit or worstFit algorithms in all scenarios of two-level placements with spreading and packing constraints. We evaluate chic-sched with publicly available VM-request traces from a production Cloud, and, comparing against bestFit, we show that it achieves 8% lower placement failure rates and more than 40% better placement locality. Finally, to quantify the goodness of constraints-based placements, we conduct experiments with a realistic MPI workload on synthetically allocated VM clusters in a public cloud. We measure a 9% performance improvement over an adverse placement in a scenario where our heuristics-based scheduler would return a good, but not perfect, placement.