Incorporating sparse representation phone identification features in automatic speech recognition using exponential families
Abstract
Sparse representation phone identification features (SPIF) is a recently developed technique to obtain an estimate of phone posterior probabilities conditioned on an acoustic feature vector. In this paper, we explore incorporating SPIF phone posterior probability estimates in large vocabulary continuous speech recognition (LVCSR) task by including them as additional features of exponential densities that model the HMM state emission likelihoods. We compare our proposed approach to a number of other well known methods of combining feature streams or multiple LVCSR systems. Our experiments show that using exponential models to combine features results in a word error rate reduction of 0.5% absolute (18.7% down to 18.2%); this is comparable to best error rate reduction obtained from system combination methods, but without having to build multiple systems or tune the system combination weights. © 2010 ISCA.