M2 ASR: Multilingual Multi-task Automatic Speech Recognition via Multi-objective Optimization
Abstract
To enable the capability of speech models across multiple languages, training multilingual, multi-task automatic speech recognition (ASR) models has gained growing interest. However, different languages and tasks result in distinct training objectives, potentially leading to conflicts during training and degrading the model's performance. To overcome this issue, we introduce M$^{2}$ASR, a multilingual, multi-task ASR framework, which formulates the problem as a constrained multi-objective optimization (MOO), where multilingual multi-task supervised training augmented by speech-to-text translation (S2TT) serve as supervised objectives and are subject to the desired performance of multilingual unsupervised training. We employ MOO techniques to avoid conflicts among multiple linguistic representations and tasks during training. Extensive experiments demonstrate that M$^{2}$ASR outperforms conventional multilingual ASR models by 28.3\% to 38.6\% across diverse ASR tasks.