Michael Picheny, Zoltan Tuske, et al.
INTERSPEECH 2019
We present three voice activity detection (VAD) algorithms that are suitable for the off-line processing of noisy speech and compare their performance on SPINE-2 evaluation data using speech recognition error rate as the quality metric. One VAD system is a simple HMM-based segmenter that uses normalized log-energy and a degree of voicing measure as raw features. The other two VAD systems focus on frequency-localized temporal information in the speech signal using a TempoRAl PatternS (TRAPS) classifier. They differ only in the processing of the TRAPS output. One VAD system uses median filtering to generate segment hypotheses, while the other is a hybrid system that uses a Viterbi search identical to that used in the HMM segmenter. Recognition on the hybrid HMM/TRAPS segmentation is more accurate than recognition on the other two segmentations by 1% absolute. This difference is statistically significant at a 99% confidence level according to a matched pairs sentence-segment word error test.
Michael Picheny, Zoltan Tuske, et al.
INTERSPEECH 2019
Hagen Soltau, George Saon, et al.
IEEE Transactions on Audio, Speech and Language Processing
Jennifer C. Lai, Kwan Min Lee
ICSLP 2002
Jing Huang, Brian Kingsbury
ICASSP 2013