BiScaled-DNN: Quantizing long-tailed datastructures with two scale factors for deep neural networks
Abstract
Fixed-point implementations (FxP) are prominently used to realize Deep Neural Networks (DNNs) efficiently on energy-constrained platforms. The choice of bit-width is often constrained by the ability of FxP to represent the entire range of numbers in the datastructure with sufficient resolution. At low bit-widths (< 8 bits), state-of-theart DNNs invariably suffer a loss in classification accuracy due to quantization/saturation errors. In this work, we leverage a key insight that almost all datastructures in DNNs are long-tailed i.e., a significant majority of the elements are small in magnitude, with a small fraction being orders of magnitude larger. We propose BISCALED-FXP, a new number representation which caters to the disparate range and resolution needs of long-tailed data-structures. The key idea is, whilst using the same number of bits to represent elements of both large and small magnitude, we employ two different scale factors viz. scale-fine and scale-wide in their quantization. Scale-fine allocates more fractional bits providing resolution for small numbers, while scale-wide favors covering the entire range of large numbers albeit at a coarser resolution. We develop a BiScaled DNN accelerator which computes on BISCALED-FXP tensors. A key challenge is to store the scale factor used in quantizing each element as computations that use operands quantized with different scale-factors need to scale their result. To minimize this overhead, we use a block sparse format to store only the indices of scale-wide elements, which are few in number. Also, we enhance the BISCALED-FXP processing elements with shifters to scale their output when operands to computations use different scale-factors. We develop a systematic methodology to identify the scale-fine and scale-wide factors for the weights and activations of any given DNN. Over 8 state-of-the-art image recognition benchmarks, BISCALED-FXP reduces 2 computation bits over conventional FxP, while also slightly improving classification accuracy on all cases. Compared to FxP8, the performance and energy benefits range between 1.43×-3.86× and 1.4×-3.7× respectively.