SKIP-SALSA: Skip Synchronous Fusion of ASR LLM Decoders

Ashish Mittal; Darshan Prabhu; Sunita Sarawagi; Preethi Jyothi

INTERSPEECH 2025

Conference paper

17 Aug 2025

SKIP-SALSA: Skip Synchronous Fusion of ASR LLM Decoders

Abstract

The integration of large language models (LLMs) with automatic speech recognition (ASR) has gained significant interest in recent years. However, effective LLM-ASR integration for low-resource languages remains challenging. Loose coupling via N-best lists fails due to high ASR error rates, while tight coupling — treating audio as tokens — requires too much training data. A promising middle ground, SALSA, was recently proposed, cascading an ASR decoder into an LLM decoder via lightweight projection layers, enabling synchronous decoding despite differing tokenizations. In this paper we show that SALSA fails when the ASR and LLM tokenizations have a large token fertility gap. This problem particularly plagues low-resource languages; the ASR decoder overtokenizes LLM tokens starving the LLM decoder of sufficient audio context. To address this, we propose SKIP-SALSA, which adaptively skips ahead and advances the ASR decoder states to synchronize with the LLM. The skip size is learned via a lightweight skip predictor. SKIP-SALSA significantly improves ASR performance on multiple low-resourcelanguages yielding up to 20% over a strong baseline.

Conference paper