Composable systems with OpenShift
Imagine being able to dynamically create a computer with the exact resources a workload needs — however much CPU or GPU power, or as much memory. This is the vision of composable systems. Instead of buying data center servers with predefined amounts of memory, GPU, and other resources, composable data center operators could create on-the-fly the servers’ setups that their users require from distinct pools of resources.
This ability to compose systems on demand is a powerful one, as it has the potential to address some long-standing problems in data center operation, including over-provisioning and sustainability. With current server designs, compute resources can only be consumed in given chunks. If, for example, a data center has four GPUs per server and these are filled with workloads using three GPUs, then new workloads needing two, three, or four GPUs can’t run, even though there are many free GPUs in the data center. This leads to over-provisioning: Data center operators have to deploy more GPUs than the workloads they serve actually need to ensure that there will be enough GPUs available.
Over-provisioning is also a component of the sustainability problem: If data center operators have to buy more resources than they need, then they put additional strain on the supply chain of underlying materials made to use them, as well as requiring more energy to power them. Upgrade patterns of current data centers add to this problem. For example, if a data center has four GPU servers but now needs eight GPU servers, an entire new server — including CPUs, memory, and motherboards — have to be purchased to host them, even if the other elements of existing servers meet state-of-the-art processing demands.
Research and development on composable systems has been accelerating rapidly in recent years, with new systems capable of dynamically composing peripherals, such as GPUs from a pool of GPUs. With these systems, administrators can, for example, create two servers each with four GPUs or one server with eight GPUs and one with only CPUs, as need demands. In addition, new protocols and standards, like Compute Express Link (CXL), are being developed that allow for the composition of more complex resources like memory at multiple scales (within a chip, or over a network).
However, important questions remain regarding the software stack for managing composable systems. The benefits of composable systems come with the cost of additional management complexity. Some issues include the need to allocate and track usage of composed resources, as well as the need to integrate with potentially multiple different composition fabrics — the lowest level system layer that actually makes the physical composition happen.
At the recent RedHat Summit, IBM Researchers demonstrated a complete prototype software stack that enables using composable systems on RedHat’s OpenShift container platform, with GPUs as the initial use case. The stack consists of four layers: At the lowest layer is the proprietary composition fabric, powered by Liqid. Above this is the Sunfish Management Framework (previously called OFMF, OpenFabrics Management Framework), an open-source project from the OpenFabrics Alliance that provides a generic interface to query and manage fabric resources. IBM Research has worked with partners in this project to extend Sunfish with features that allow interacting with composable compute fabrics. The third layer is the composition manager. This is a new component, also part of the Sunfish project and which IBM Research contributed to, that enables allocation, creation, and tracking of composed resources via the Sunfish interfaces. The final element is a new OpenShift operator, which allows users to compose resources in OpenShift, using familiar Kubernetes patterns.
The IBM Research prototype demonstrates a standard, open software stack to create composed resources for container workloads deployed in Kubernetes, which is key step on the journey towards production-ready composable systems. But right now, a user still must manually create the composed resource their workloads require and free it when it is done.
Thanks to the technology demonstrated in this proof of concept, users will be able to create policies for dynamically controlling their hardware usage, and provide the hardware required to the nodes where computation is happening, without having to install expensive hardware an all nodes. In addition, with the advent of composable memory, this approach will also enable the creation of new patterns of communication for applications leveraging Multi Architecture OpenShift clusters.