Multi-task CTC training with auxiliary feature reconstruction for end-to-end speech recognition

Gakuto Kurata; Kartik Audhkhasi

doi:10.21437/Interspeech.2019-1710

INTERSPEECH 2019

Conference paper

15 Sep 2019

Multi-task CTC training with auxiliary feature reconstruction for end-to-end speech recognition

View publication

Abstract

We present a multi-task Connectionist Temporal Classification (CTC) training for end-to-end (E2E) automatic speech recognition with input feature reconstruction as an auxiliary task. Whereas the main task of E2E CTC training and the auxiliary reconstruction task share the encoder network, the auxiliary task tries to reconstruct the input feature from the encoded information. In addition to standard feature reconstruction, we distort the input feature only in the auxiliary reconstruction task, such as (1) swapping the former and latter parts of an utterance, or (2) using a part of an utterance by stripping the beginning or end parts. These distortions intentionally suppress long-span dependencies in the time domain, which avoids overfitting to the training data. We trained phone-based CTC and word-based CTC models with the proposed multi-task learning and demonstrated that it improves ASR accuracy on various test sets that are matched and unmatched with the training data.

Conference paper