About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICASSP 2023
Conference paper
MODELING TURN-TAKING IN HUMAN-TO-HUMAN SPOKEN DIALOGUE DATASETS USING SELF-SUPERVISED FEATURES
Abstract
Self-supervised pre-trained models have consistently delivered state-of-art results in the fields of natural language and speech processing. However, we argue that their merits for modeling Turn-Taking for spoken dialogue systems still need further investigation. Due to that, in this paper we introduce a modular End-to-End system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features to model the specific Turn-Taking task of End-of-Turn Detection (EOTD). Several architectures to model the EOTD task. using audio-only, text-only and audio+text modalities are presented, and their performance and robustness are carefully evaluated for three different human-to-human spoken dialogue datasets. The proposed model not only achieves SOTA results for EOTD, but also brings light to the possibility of powerful and well fine-tuned self-supervised models to be successfully used for a wide variety Turn-Taking tasks.