Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition
Authors: Jiadong Wang, Zexu Pan, Malu Zhang, Robby T. Tan, Haizhou Li
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our framework is evaluated in terms of Word Error Rate (WER) on the original videos, the videos corrupted by concealed lips, and the videos restored using the framework with several existing state-of-the-art audiovisual speech recognition methods. Experimental results substantiate that our framework significantly mitigates performance degradation resulting from lip occlusion. |
| Researcher Affiliation | Academia | Jiadong Wang1,3, Zexu Pan1, Malu Zhang2*, Robby T. Tan 1, Haizhou Li 3,1 1National University of Singapore 2University of Electronic Science and Technology of China 3Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China jiadong.wang@u.nus.edu, maluzhang@uestc.edu.cn |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include a statement about releasing code or a link to a code repository. |
| Open Datasets | Yes | Our AVLR is trained on LRS2 (Afouras et al. 2018) which encompasses 224 hours of audio-visual data along with text annotation. In the evaluation phase, we assess all chosen audio-visual speech recognition methods based on their WER on LRS2 and LRS3 datasets (Afouras et al. 2018). |
| Dataset Splits | Yes | Our AVLR is trained on LRS2 (Afouras et al. 2018) which encompasses 224 hours of audio-visual data along with text annotation. In the evaluation phase, we assess all chosen audio-visual speech recognition methods based on their WER on LRS2 and LRS3 datasets (Afouras et al. 2018). The selection of the lip-reading expert follows (Wang, Qian, and Li 2022) which chooses AV-Hubert base which is pre-trained on the Voxceleb2 (Chung, Nagrani, and Zisserman 2018) and LRS3 (Afouras et al. 2018), and finetuned on the LRS2 (Afouras et al. 2018). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, memory). |
| Software Dependencies | No | The paper mentions software components like Resnet, AV-Hubert, and Transformer but does not specify their version numbers or other ancillary software with versions. |
| Experiment Setup | Yes | Before applying the AVLR procedure, we follow the preprocessing steps outlined in (Prajwal et al. 2020; Wang et al. 2023). This includes utilizing a face detection module to identify faces, cropping them using bounding boxes, and resizing them to dimensions of 96 96. We simulate lip occlusion following (Hong et al. 2023) which employs objects in the Naturalistic Occlusion Generation dataset (Voo, Jiang, and Loy 2022). In detail, we add an object to about 30% frames by aligning its centre with one of the mouth landmarks. The downsample block in the occluded-frame detection and the matching module downsamples reduces the size of input images to 1/4. The discriminator only distinguishes lower-half faces between Imat and Igt. We use an L1 loss to optimize the synthesis module and the matching module, as shown in Eq. 5: Lrec = |Imat Igt| + |Isyn Igt|. To ensure the visual realism of Imat, we apply a GAN loss (Park et al. 2022; Liang et al. 2022; Goodfellow et al. 2020). |