A unified approach to multi-pose audio-visual ASR
Abstract
The vast majority of studies in the field of audio-visual automatic speech recognition (AVASR) assumes frontal images of a speaker's face, but this cannot always be guaranteed in practice. Hence our recent research efforts have concentrated on extracting visual speech information from non-frontal faces, in particular the profile view. The introduction of additional views to an AVASR system increases the complexity of the system, as it has to deal with the different visual features associated with the various views. In this paper, we propose the use of linear regression to find a transformation matrix based on synchronous frontal and profile visual speech data, which is used to normalize the visual speech in each viewpoint into a single uniform view. In our experiments for the task of multi-speaker lipreading, we show that this "pose-invariant" technique reduces train/test mismatch between visual speech features of different views, and is of particular benefit when there is more training data for one viewpoint over another (e.g. frontal over profile).