Saurabh Paul, Christos Boutsidis, et al.
JMLR
Conformer CTC-Encoders have consistently delivered state-of-the-art results in the field of Automatic Speech Recognition (ASR); however, their merits for tasks that demand more semantic and paralinguistic information, such as Automatic Speech Understanding (ASRU), Speech Emotion Recognition (SER) and Speech Translation (ST), still need further investigation. In this paper, we introduce a Speech Large Language Model (SLLM) system based on a Conformer CTC-Encoder and on the Granite Large Language Model that allowed us to perform several experiments on ASR, SER and ST tasks. These experiments have not only confirmed the strength of Conformer CTC-encoders for ASR, but also, they have shown that the outputs of intermediate Conformer Blocks, of the Conformer CTC-Encoder, carry important information for SER tasks and that the Conformer CTC-Encoder can be efficiently fine-tuned for SER tasks.
Saurabh Paul, Christos Boutsidis, et al.
JMLR
C.A. Micchelli, W.L. Miranker
Journal of the ACM
Joxan Jaffar
Journal of the ACM
Nimrod Shabtay, Zvi Kons, et al.
INTERSPEECH 2025