Conference paper

Improving End-to-end Mixed-case ASR with Knowledge Distillation and Integration of Voice Activity Cues

Abstract

E2E mixed-case (MC) ASR is a more challenging task than unicase (UC) ASR because of the necessity of capitalizing and punctuating the decoded outputs simultaneously. MC models that are simply trained on formatted transcriptions often suffer from various negative impacts, notably degradation of the case-and-punctuation-insensitive performance due to the increased learning complexity. In this paper we address novel techniques for training E2E MC ASR models and improve casing-and-punctuation sensitive and insensitive performance. Our approach incorporates knowledge distillation from UC teacher to MC student models not only to improve capitalization and punctuation accuracy but also to maximize phone classification capability in MC ASR. Furthermore, we attempt to integrate voice activity cues into MC ASR to support text formatting task. Our proposed method provides a significant improvement of up to 9.2% relative error reduction over baseline models that operate at a similar decoding cost.