About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
CLOUD 2020
Conference paper
The design and implementation of a scalable deep learning benchmarking platform
Abstract
The current Deep Learning (DL) landscape is fast-paced and is rife with non-uniform models, hardware/software (HW/SW) stacks. Currently, there is no DL benchmarking platform to facilitate the evaluation and comparison of DL innovations, be it models, frameworks, libraries, or hardware. As a result, the current practice of evaluating the benefits of proposed DL innovations is both arduous and error-prone-stifling the adoption of the innovations. In this work, we first identify 10 design features that are desirable within a DL benchmarking platform. These features include: Performing the evaluation in a consistent, reproducible, and scalable manner, being framework and hardware agnostic, supporting real-world benchmarking workloads, providing in-depth model execution inspection across the HW/SW stack levels, etc. We then propose MLModelScope, a DL benchmarking platform that realizes these 10 design objectives. MLModelScope introduces a specification to define DL model evaluations and provides a runtime to provision the evaluation workflow using the user-specified HW/SW stack. MLModelScope defines abstractions for frameworks and supports the board range of DL models and evaluation scenarios. We implement MLModelScope as an open-source project with support for all major frameworks and hardware architectures. Through MLModelScope's evaluation and automated analysis workflows, we perform a case-study analysis of 37 models across 4 systems and show how model, hardware, and framework selection affects model accuracy and performance under different benchmarking scenarios. We further demonstrate how MLModelScope's tracing capability gives a holistic view of model execution and helps pinpoint bottlenecks.