Publication
INTERSPEECH 2023
Conference paper

Improving RNN Transducer Acoustic Models for English Conversational Speech Recognition

Download paper

Abstract

In this paper we investigate several techniques for improving the performance of RNN transducer (RNNT) acoustic models for conversational speech recognition and report state-of-the-art word error rates (WERs) on the 2000-hour Switchboard dataset. We show that n-best label smoothing and length perturbation which show improved performance on the smaller 300-hour dataset are also very effective on large datasets. We further give a rigorous theoretical interpretation of the n-best label smoothing based on stochastic approximation for training RNNT under the maximum likelihood criterion. Random quantization is also introduced to improve the generalization of RNNT models. On the 2000-hour Switchboard dataset, we report a single model performance of 4.9% and 7.7% WERs on the Switchboard and CallHome portions of NIST Hub5 2000, 7.1% on NIST Hub5 2001 and 6.8% on NIST RT03, without using external LMs.

Date

Publication

INTERSPEECH 2023

Authors

Topics

Resources

Share