Neural Dubber: Dubbing for Videos According to Scripts

Authors: Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yuxuan Wang, Hang Zhao

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.
Researcher Affiliation Collaboration Chenxu Hu1, Qiao Tian2, Tingle Li1,3, Yuping Wang2, Yuxuan Wang2, Hang Zhao1,3 1IIIS, Tsinghua University 2Byte Dance 3Shanghai Qi Zhi Institute
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a project website link (https://tsinghua-mars-lab.github.io/Neural Dubber/) but does not explicitly state that the source code for the described methodology is available there, nor does it provide a direct link to a code repository.
Open Datasets Yes Single-speaker Dataset In the single-speaker setting, we conduct experiments on the chemistry lecture dataset from Lip2Wav [35]. Multi-speaker Dataset In multi-speaker setting, we conduct experiments on the LRS2 [1] dataset, which consists of thousands of sentences spoken by various speakers on BBC channels.
Dataset Splits Yes Finally, the dataset contains 6,640 samples, with the total video length of approximately 9 hours. We randomly split the dataset into 3 sets: 6240 samples for training, 200 samples for validation, and 200 samples for testing.
Hardware Specification Yes We train Neural Dubber on 1 NVIDIA V100 GPU. Each Parallel Wave Gan vocoder is trained on 1 NVIDIA V100 GPU for 1000K steps.
Software Dependencies No The paper mentions using specific models like Res Net18 and Res Net50, optimizers like Adam, and vocoders like Parallel Wave GAN, and refers to an "open-source grapheme-to-phoneme tool" and "open-source Tacotron repository," but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We use the Adam optimizer [25] with β1 = 0.9, β2 = 0.98, ε = 10 9 and follow the same learning rate schedule in [48]. Our model is optimized with the loss similar to that in [38]. We set the batchsize to 18 and 24 on chem dataset and LRS2 dataset respectively. It takes 200k/300k steps for training until convergence on the chem/LRS2 dataset.