Improving RNN Transducer Acoustic Models for English Conversational Speech Recognition
Abstract
In this paper we investigate several techniques for improving the performance of RNN transducer (RNNT) acoustic models for conversational speech recognition and report state-of-the-art word error rates (WERs) on the 2000-hour Switchboard dataset. We show that n-best label smoothing and length perturbation which show improved performance on the smaller 300-hour dataset are also very effective on large datasets. We further give a rigorous theoretical interpretation of the n-best label smoothing based on stochastic approximation for training RNNT under the maximum likelihood criterion. Random quantization is also introduced to improve the generalization of RNNT models. On the 2000-hour Switchboard dataset, we report a single model performance of 4.9% and 7.7% WERs on the Switchboard and CallHome portions of NIST Hub5 2000, 7.1% on NIST Hub5 2001 and 6.8% on NIST RT03, without using external LMs.