Murat Saraclar, Abhinav Sethy, et al.
ASRU 2013
We present a deep neural network (DNN) architecture which learns time-dependent offsets to acoustic feature vectors according to a discriminative objective function such as maximum mutual information (MMI) between the reference words and the transformed acoustic observation sequence. A key ingredient in this technique is a greedy layer-wise pretraining of the network based on minimum squared error between the DNN outputs and the offsets provided by a linear feature-space MMI (FMMI) transform. Next, the weights of the pretrained network are updated with stochastic gradient ascent by backpropagating the MMI gradient through the DNN layers. Experiments on a 50 hour English broadcast news transcription task show a 4% relative improvement using a 6-layer DNN transform over a state-of-the-art speaker-adapted system with FMMI and modelspace discriminative training.
Murat Saraclar, Abhinav Sethy, et al.
ASRU 2013
George Saon
ICASSP 2006
George Saon, Samuel Thomas, et al.
INTERSPEECH 2013
Michael Picheny, Zoltan Tuske, et al.
INTERSPEECH 2019