About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Conference paper
Robust detection of visual ROI for automatic speechreading
Abstract
In this paper we present our work on visual pruning in an audio-visual (AV) speech recognition scenario. Visual speech information has been successfully used in circumstances where audio-only recognition suffers(e.g, noisy environments). Tracking and extraction of region-of-interest (ROI) (e.g., speaker's mouth region) from video is an essential component of such systems. It is important for the visual front-end to handle tracking errors that result in noisy visual data and hamper performance. In this paper, we present our robust visual front-end, investigate methods to prune visual noise and its effect on the performance of the AV speech recognition systems. Specifically, we estimate the "goodness of ROI" using Gaussian mixture models and our experiments indicate that significant performance gains are achieved with good quality visual data.