LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Authors: Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, Ethan Fetaya

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lip Voicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary. Moreover, our experiments show that the inclusion of the text modality plays a major role in the intelligibility of the produced speech, readily perceptible while listening, and is empirically reflected in the substantial reduction of the word error rate (WER) metric. We demonstrate the effectiveness of Lip Voicer through human evaluation, which shows that it produces more natural and synchronized speech signals compared to competing methods.
Researcher Affiliation Academia Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, Ethan Fetaya Faculty of Electrical Engineering Bar-Ilan University Israel {yochai.yemini,aviv.shamsian,brachal,sharon.gannot,ethan.fetaya}@biu.ac.il
Pseudocode Yes The inference process is summarized in Algorithm 1 in the Appendix.
Open Source Code Yes Project page and code: https://github.com/yochaiye/LipVoicer
Open Datasets Yes Lip Voicer is compared against the baselines on the highly challenging datasets LRS2 (Afouras et al., 2018a) and LRS3 (Afouras et al., 2018b).
Dataset Splits No The paper states: We train Lip Voicer using the pretrain+train splits of LRS2 and LRS3 on each dataset separately, and evaluation is carried out on the full unseen test data splits. While it mentions training and testing splits, it does not explicitly define a separate validation split with specific percentages or counts.
Hardware Specification Yes Our implementation was written in Py Torch, and we used 4 NVIDIA GeForce RTX 2080 Ti for our experiments.
Software Dependencies No The paper mentions: Our implementation was written in Py Torch, and we used the publicly available code and the pre-trained model released by the authors of Burchi & Timofte (2023) and dlib (King, 2009), and Diff Wave. However, it does not provide specific version numbers for any of these software components.
Experiment Setup Yes The diffusion process has T = 400 steps. We set β1 = 0.0001 and βT = 0.02, and use a linear noise schedule. The number of input and output channels is fixed to 80... We use 12 residual layers with 512 residual channels... We trained on 1,000,000 mini-batches of 16 videos and used Adam optimizer with learning rate of 2e 4 without scheduling. For the classifier-free guidance mechanism, we follow Ho & Salimans (2021) by setting the dropout probability on the conditioning to 0.2.