Semi parametric concatenative TTS with instant voice modification capabilities
Abstract
Recently, a glottal vocoder has been integrated in the IBM concatenative TTS system and certain configurable global voice transformations were defined in the vocoder parameter space. The vocoder analysis employs a novel robust glottal source parameter estimation strategy. The vocoder is applied to the voiced speech only, while unvoiced speech is kept unparameterized, thus contributing to the perceived naturalness of the synthesized speech. The semi-parametric system enables independent modifications of the glottal source and vocal tract components on-the-fly by embedding the voice transformations in the synthesis process. The transformations effect ranges from slight voice altering to a complete change of the perceived speaker personality. Pitch modifications enhance these changes. At the same time, the voice transformations are simple enough to be easily controlled externally to the system. This allows the users either to fine tune the voice sound or to create instantly multiple distinct virtual voices. In both cases, the synthesis is based on a large and meticulously cleaned concatenative TTS voice with a broad phonetic coverage. In this paper we present the system and provide subjective evaluations of its voice modification capabilities. The technology presented in this paper is implemented in IBM Watson TTS service.