Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
Authors: Ji-Hoon Kim, Jaehun Kim, Joon Son Chung
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. |
| Researcher Affiliation | Academia | Ji-Hoon Kim*, Jaehun Kim*, Joon Son Chung Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea {jh.kim, kjaehun, joonson}@kaist.ac.kr |
| Pseudocode | No | The paper describes its model architecture and components in text and diagrams but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | Synthesised samples are available at our demo page: https://mm.kaist.ac.kr/projects/LTBS. (This is a demo page, not source code.) We follow the official implementation for VCA-GAN5 and MT6. (These are for baseline models, not the proposed method's code.) |
| Open Datasets | Yes | GRID (Cooke et al. 2006) is one of the established datasets for lip to speech synthesis in a constrained environment. It contains 33 speakers and 50 minutes of short video clips per speaker. The number of vocabularies in GRID is only 51 and the speakers always face forward with nearly no head movement. In our experiment, the dataset is split into train (80%), validation (10%), and test set (10%) by sampling equally from all speakers. Lip2Wav (Prajwal et al. 2020) is a large-scale benchmark dataset for an unconstrained and large vocabulary lip to speech synthesis. It comprises real-world lecture recordings featuring 5 distinct speakers, with about 20 hours of video for each speaker. Our experiment is conducted on 2 speakers, Lip2Wav-Chemistry and Lip2Wav-Chess, as in Glow LTS (He et al. 2022). The two speakers are trained jointly in a multi-speaker setting, and both are equally divided into 80-10-10% for train, validation, and test sets. |
| Dataset Splits | Yes | In our experiment, the dataset is split into train (80%), validation (10%), and test set (10%) by sampling equally from all speakers. For the Lip2Wav dataset, we sample contiguous 75 frames and train our model for 900 epochs. The two speakers are trained jointly in a multi-speaker setting, and both are equally divided into 80-10-10% for train, validation, and test sets. |
| Hardware Specification | Yes | Our model is trained on four NVIDIA A5000 GPUs with a batch size of 64. |
| Software Dependencies | No | The paper mentions software components like 'Adam W optimiser', 'Fre-GAN', 'Hu BERT', 'p YIN algorithm', and 'Face Alignment', but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | Our model is trained on four NVIDIA A5000 GPUs with a batch size of 64. We use Adam W optimiser (Loshchilov and Hutter 2019) with β1 = 0.9, β2 = 0.98, and ϵ = 10 9. The learning rate is fixed to 2 10 4, and the weight decay is set to 10 6. For training the GRID dataset, we randomly sample consecutive sequence with a length of 50, and the model is trained for 400 epochs. For the Lip2Wav dataset, we sample contiguous 75 frames and train our model for 900 epochs. To prevent overfitting, we apply data augmentations: horizontal flipping with probability of 50%, and random masking with fixed position throughout all frames. The masked area is randomly sampled within the range from 10 10 to 30 30. |