Michael Picheny, Zoltan Tuske, et al.
INTERSPEECH 2019
Mel-filter banks are commonly used in speech recognition, as they are motivated from theory related to speech production and perception. While features derived from mel-filter banks are quite popular, we argue that this filter bank is not really an appropriate choice as it is not learned for the objective at hand, i.e. speech recognition. In this paper, we explore replacing the filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network. Thus, the filter bank is learned to minimize cross-entropy, which is more closely tied to the speech recognition objective. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter bank learning approach, compared to having a fixed set of filters. © 2013 IEEE.
Michael Picheny, Zoltan Tuske, et al.
INTERSPEECH 2019
Po-Sen Huang, Haim Avron, et al.
ICASSP 2014
Hagen Soltau, George Saon, et al.
IEEE Transactions on Audio, Speech and Language Processing
Tara N. Sainath, Avishy Carmi, et al.
ICASSP 2010