Forget a bit to learn better: Soft Forgetting for CTC-based automatic speech recognition
Abstract
Prior work has shown that connectionist temporal classification (CTC)-based automatic speech recognition systems perform well when using bidirectional long short-term memory (BLSTM) networks unrolled over the whole speech utterance. This is because whole-utterance BLSTMs better capture long-term context. We hypothesize that this also leads to overfitting and propose soft forgetting as a solution. During training, we unroll the BLSTM network only over small non-overlapping chunks of the input utterance. We randomly pick a chunk size for each batch instead of a fixed global chunk size. In order to retain some utterance-level information, we encourage the hidden states of the BLSTM network to approximate those of a pre-trained whole-utterance BLSTM. Our experiments on the 300-hour English Switchboard dataset show that soft forgetting improves the word error rate (WER) above a competitive whole-utterance phone CTC BLSTM by an average of 7-9% relative. We obtain WERs of 9.1%/17.4% using speaker-independent and 8.7%/16.8% using speaker-adapted models respectively on the Hub5-2000 Switchboard/CallHome test sets. We also show that soft forgetting improves the WER when the model is used with limited temporal context for streaming recognition. Finally, we present some empirical insights into the regularization and data augmentation effects of soft forgetting.