SALSA: Speedy ASR-LLM Synchronous Aggregation
Abstract
Automatic speech recognition (ASR) systems still lag in performance on low-resource languages. The rise of multilingual large language models (LLMs) offers the potential for effective integration with ASR systems to improve its performance on low-resource languages. One major challenge towards achieving this goal is that the tokenization of the LLM and the ASR systems differ. In this work, we propose SALSA – a synchronous, lightweight solution to merge pretrained ASR and LLM systems with varying token vocabularies. The LLM’s predictions are tokenized using the ASR system to unroll its decoder; the last ASR decoder state is then mapped using a learnable projection and added as a residual connection to the LLM’s representations. SALSA is parameter-efficient using learned projection layers only for a select set of layers in the ASR and LLM decoders. We evaluate SALSA on more than 10 low-resource languages in the FLEURS benchmark yielding substantial WER reductions of up to 36%.