Gang Wang, Fei Wang, et al.
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Deep Neural Networks (DNNs) have been shown to provide state-of-the-art performance over other baseline models in the task of predicting prosodic targets from text in a speech-synthesis system. However, prosody prediction can be affected by an interaction of short- And long-term contextual factors that a static model that depends on a fixed-size context window can fail to properly capture. In this work, we look at a recurrent formulation of neural networks (RNNs) that are deep in time and can store state information from an arbitrarily large input history when making a prediction. We show that RNNs provide improved performance over DNNs of comparable size in terms of various objective metrics for a variety of prosodic streams (notably, a relative reduction of about 6% in F0 mean-square error accompanied by a relative increase of about 14% in F0 variance), as well as in terms of perceptual quality assessed through mean-opinion-score listening tests.
Gang Wang, Fei Wang, et al.
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Atsuyoshi Nakamura, Naoki Abe
Electronic Commerce Research
Kun Wang, Juwei Shi, et al.
PACT 2011
Benny Kimelfeld, Yehoshua Sagiv
ICDT 2013