Improving CNN-Based Viseme Recognition Using Synthetic Data
Abstract
Recently, Deep Learning-based methods have obtained high accuracy for the problem of Visual Speech Recognition. However, while good results have been reported for words and sentences, recognizing shorter segments of speech, like phones, has proven to be much more challenging due to the lack of temporal and contextual information. In this work, we address the problem of recognizing visemes, that are the visual equivalent of phonemes-the smallest distinguishable sound unit in a spoken word. Viseme recognition has application in tasks such as lip synchronization, but acquiring and labeling a viseme dataset is complex and time-consuming. We tackle this problem by creating a large-scale synthetic 2D dataset based on realistic 3D facial models, automatically labelled. Then, we extract real viseme images from the GRID corpus-using audio data to locate phonemes via forced phonetic alignment and the registered video to extract the corresponding visemes-and evaluate the applicability of the synthetic dataset for recognizing real-world data.