A hybrid HMM/TRAPS model for robust voice activity detection
Abstract
We present three voice activity detection (VAD) algorithms that are suitable for the off-line processing of noisy speech and compare their performance on SPINE-2 evaluation data using speech recognition error rate as the quality metric. One VAD system is a simple HMM-based segmenter that uses normalized log-energy and a degree of voicing measure as raw features. The other two VAD systems focus on frequency-localized temporal information in the speech signal using a TempoRAl PatternS (TRAPS) classifier. They differ only in the processing of the TRAPS output. One VAD system uses median filtering to generate segment hypotheses, while the other is a hybrid system that uses a Viterbi search identical to that used in the HMM segmenter. Recognition on the hybrid HMM/TRAPS segmentation is more accurate than recognition on the other two segmentations by 1% absolute. This difference is statistically significant at a 99% confidence level according to a matched pairs sentence-segment word error test.