Gosia Lazuka, Andreea Simona Anghel, et al.
SC 2024
Autoregressive Attentive Sequence-to-Sequence (S2S) speech synthesis is considered state-of-the-art in terms of speech quality and naturalness, as evaluated on a finite set of testing utterances. However, it can occasionally suffer from stability issues at inference time, such as local intelligibility problems or utterance incompletion. Frequently, a model's stability varies from one checkpoint to another, even after the training loss shows signs of convergence, making the selection of a stable model a tedious and time-consuming task. In this work we propose a novel stability metric designed for automatic checkpoint selection based on incomplete utterance counts within a validation set. The metric is based solely on attention matrix analysis in inference mode and requires no ground-truth output targets. The proposal runs 125 times faster than real-time on a GPU (Tesla- K80), allowing convenient incorporation during training to filter out unstable checkpoints, and we demonstrate, via objective and perceptual metrics, its effectiveness in selecting a robust model that attains a good trade-off between stability and quality.
Gosia Lazuka, Andreea Simona Anghel, et al.
SC 2024
Inkit Padhi, Yair Schiff, et al.
ICASSP 2021
Ben Fei, Jinbai Liu
IEEE Transactions on Neural Networks
Robert Farrell, Rajarshi Das, et al.
AAAI-SS 2010