Kernel methods match deep neural networks on TIMIT
Po-Sen Huang, Haim Avron, et al.
ICASSP 2014
The generation of natural and expressive prosodic contours is an important component of a text-to-speech (TTS) system which, in most classical architectures, relies on the existence of a text-analysis processor that can extract prosody-predictive features and pass them to a statistical learning model. These features can range from basic properties of the input string to rich high-level features which may not be always available when developing a TTS system in a new language with sparse computational resources. In this work we investigate how the prosody model of a speech-synthesis system performs as a function of different predictive feature sets that assume access to a certain amount of rich resources. We investigate, using objective metrics, the effect of relaxing the assumptions on input representations for prosody prediction for 5 languages, and evaluate the perceptual implications for US English.
Po-Sen Huang, Haim Avron, et al.
ICASSP 2014
Bhuvana Ramabhadran, Jing Huang, et al.
INTERSPEECH - Eurospeech 2003
Asaf Rendel, Raul Fernandez, et al.
ICASSP 2016
Tara N. Sainath, Avishy Carmi, et al.
ICASSP 2010