Effective Training of RNN Transducer Models on Diverse Sources of Speech and Text Data
Abstract
This paper proposes a novel modeling framework for effective training of end-to-end automatic speech recognition (ASR) models on various sources of data from diverse domains: speech paired with clean ground truth transcripts, speech with noisy pseudo transcripts from semi-supervised decodes and unpaired text-only data. In our proposed approach, we build a recurrent neural network transducer (RNN-T) model with a shared multimodal encoder, multi-branch prediction networks and a shared common joint network. To train on unpaired text-only data sets along with transcribed speech data, the shared encoder is trained to process both speech and text modalities. Differences in data from multiple domains are effectively handled by training a multi-branch prediction network on various different data sets before an interpolation step combines the multi-branch prediction networks back into a computationally-efficient single branch. We show the benefit of our proposed technique on several ASR test sets by comparing our models to those trained by simple data mixing. The technique provides a significant relative improvement of up to 6% over baseline systems operating at a similar decoding cost.