Statistical text-to-speech synthesis based on segment-wise representation with a norm constraint

Stas Tiomkin; David Malah; Slava Shechtman

doi:10.1109/TASL.2010.2040795

IEEE Transactions on Audio, Speech and Language Processing

Paper

24 Jun 2010

Statistical text-to-speech synthesis based on segment-wise representation with a norm constraint

View publication

Abstract

In statistical HMM-based text-to-speech systems (STTS), speech feature dynamics is modeled by first- and second-order feature frame differences, which, typically, do not satisfactorily represent frame to frame feature dynamics present in natural speech. The reduced dynamics results in over-smoothing of speech features, often sounding as muffled synthesized speech. In this correspondence, we propose a method to enhance a baseline STTS system by introducing a segment-wise model representation with a norm constraint. The segment-wise representation provides additional degrees of freedom in speech feature determination. We exploit these degrees of freedom for increasing the speech feature vector norm to match a norm constraint. As a result, statistically generated speech features are less over-smoothed, resulting in more natural sounding speech, as judged by listening tests. © 2006 IEEE.

Paper