Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
The integration of large language models (LLMs) with automatic speech recognition (ASR) has gained significant interest in recent years. However, effective LLM-ASR integration for low-resource languages remains challenging. Loose coupling via N-best lists fails due to high ASR error rates, while tight coupling — treating audio as tokens — requires too much training data. A promising middle ground, SALSA, was recently proposed, cascading an ASR decoder into an LLM decoder via lightweight projection layers, enabling synchronous decoding despite differing tokenizations. In this paper we show that SALSA fails when the ASR and LLM tokenizations have a large token fertility gap. This problem particularly plagues low-resource languages; the ASR decoder overtokenizes LLM tokens starving the LLM decoder of sufficient audio context. To address this, we propose SKIP-SALSA, which adaptively skips ahead and advances the ASR decoder states to synchronize with the LLM. The skip size is learned via a lightweight skip predictor. SKIP-SALSA significantly improves ASR performance on multiple low-resourcelanguages yielding up to 20% over a strong baseline.
Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025
Gosia Lazuka, Andreea Simona Anghel, et al.
SC 2024
Yidi Wu, Thomas Bohnstingl, et al.
ICML 2025