Reconciling pronunciation differences between the frontend and back-end in the IBM speech synthesis system
Abstract
In this paper, methods for reconciling pronunciation differences between a rule-based front-end and the pronunciations observed in a database of recorded speech are presented. The methods are applied to the IBM Expressive Speech Synthesis System [1] for both unrestricted and limited-domain text-to-speech synthesis. One method is based on constructing a multiple pronunciation lattice for the given sentence and scoring it using word and phoneme n-gram statistics computed from the target speaker's database. A second method consists of storing observed pronunciations and introducing them as alternates in the search. We compare the strengths and weaknesses of these two methods. Results show that improvements are achieved in both limited and unrestricted domains, with the largest gains coming in the limited-domain case.