Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

Authors: Jiadong Wang, Zexu Pan, Malu Zhang, Robby T. Tan, Haizhou Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our framework is evaluated in terms of Word Error Rate (WER) on the original videos, the videos corrupted by concealed lips, and the videos restored using the framework with several existing state-of-the-art audiovisual speech recognition methods. Experimental results substantiate that our framework significantly mitigates performance degradation resulting from lip occlusion.
Researcher Affiliation Academia Jiadong Wang1,3, Zexu Pan1, Malu Zhang2*, Robby T. Tan 1, Haizhou Li 3,1 1National University of Singapore 2University of Electronic Science and Technology of China 3Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China jiadong.wang@u.nus.edu, maluzhang@uestc.edu.cn
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not include a statement about releasing code or a link to a code repository.
Open Datasets Yes Our AVLR is trained on LRS2 (Afouras et al. 2018) which encompasses 224 hours of audio-visual data along with text annotation. In the evaluation phase, we assess all chosen audio-visual speech recognition methods based on their WER on LRS2 and LRS3 datasets (Afouras et al. 2018).
Dataset Splits Yes Our AVLR is trained on LRS2 (Afouras et al. 2018) which encompasses 224 hours of audio-visual data along with text annotation. In the evaluation phase, we assess all chosen audio-visual speech recognition methods based on their WER on LRS2 and LRS3 datasets (Afouras et al. 2018). The selection of the lip-reading expert follows (Wang, Qian, and Li 2022) which chooses AV-Hubert base which is pre-trained on the Voxceleb2 (Chung, Nagrani, and Zisserman 2018) and LRS3 (Afouras et al. 2018), and finetuned on the LRS2 (Afouras et al. 2018).
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, memory).
Software Dependencies No The paper mentions software components like Resnet, AV-Hubert, and Transformer but does not specify their version numbers or other ancillary software with versions.
Experiment Setup Yes Before applying the AVLR procedure, we follow the preprocessing steps outlined in (Prajwal et al. 2020; Wang et al. 2023). This includes utilizing a face detection module to identify faces, cropping them using bounding boxes, and resizing them to dimensions of 96 96. We simulate lip occlusion following (Hong et al. 2023) which employs objects in the Naturalistic Occlusion Generation dataset (Voo, Jiang, and Loy 2022). In detail, we add an object to about 30% frames by aligning its centre with one of the mouth landmarks. The downsample block in the occluded-frame detection and the matching module downsamples reduces the size of input images to 1/4. The discriminator only distinguishes lower-half faces between Imat and Igt. We use an L1 loss to optimize the synthesis module and the matching module, as shown in Eq. 5: Lrec = |Imat Igt| + |Isyn Igt|. To ensure the visual realism of Imat, we apply a GAN loss (Park et al. 2022; Liang et al. 2022; Goodfellow et al. 2020).