Vishal Sunder, Samuel Thomas, et al.
ICASSP 2022
End-to-end training of recurrent neural network transducers (RNN-Ts) does not require frame-level alignments between audio and output symbols. Because of that, the posterior lattices defined by the predictive distributions from different RNN-Ts trained on the same data can differ a lot, which poses a new set of challenges in knowledge distillation between such models. These discrepancies are especially prominent in the posterior lattices between an offline model and a streaming model, which can be expected from the fact that the streaming RNN-T emits symbols later than the offline RNN-T. We propose a method to train an RNN-T so that the posterior peaks at each node in the posterior lattice are aligned with the ones from a pretrained model for the same utterance. By utilizing this method, we can train an offline RNN-T that can serve as a good teacher to train a student streaming RNN-T. Experimental results on the standard Switchboard conversational telephone speech corpus demonstrate accuracy improvements for a streaming unidirectional RNN-T by knowledge distillation from an offline bidirectional counterpart.
Vishal Sunder, Samuel Thomas, et al.
ICASSP 2022
Lidia Mangu, George Saon, et al.
ICASSP 2015
George Saon, Daniel Povey, et al.
INTERSPEECH - Eurospeech 2005
Samuel Thomas, Sriram Ganapathy, et al.
ICASSP 2014