The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video.
We propose a two-stream ConvNet architecture that enables the mapping between the sound and the mouth images to be trained end-to-end from unlabelled data. The trained network is used to determine the lip-sync error in a video.
We apply the network to two further tasks: active speaker detection and lip reading. On both tasks we set a new state-of-the-art on standard benchmark datasets.