Publication
SC 2023
Workshop paper

Sunfish: An Open Centralized Composable HPC Management Framework

Abstract

Traditional HPC systems are provisioned with static fixed quantities of memory, storage, accelerators, and CPU resources to execute requested computation. This is not sufficient for today's datacenters that are running modern dynamic workloads, resulting in workloads executing on systems that are not optimized for their needs. Workloads may require hardware resources, e.g. GPUs, that are present in the datacenter but not on the server on which the workload is executing, leading to resource stranding. Conversely, compute resources on a given server may be under utilized because they are not required by the workload running on that server. Thus, datacenters often end up over-provisioning hardware resources to attempt enabling any workload to run on any server in the cluster. Composable Disaggregated Infrastructure (CDI) enables servers to be composed, as needed, out of physically disaggregated resources to support the requirements of a given workload. The OpenFabrics Alliance in collaboration with the DMTF, SNIA, and the CXL Consortium, is developing the Sunfish Management Framework and a hardware Composability Manager. Sunfish is an open-source Resource Manager, designed for configuring fabric interconnects and managing composable disaggregated resources in dynamic HPC infrastructures using client-friendly abstractions. The goal of Sunfish is to enable interoperability through common interfaces, to enable client Managers to efficiently connect workloads with resources in a complex heterogeneous ecosystem, without having to worry about the underlying network technology.