VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

Authors: Yihan Wu, Junliang Guo, Xu Tan, Chen Zhang, Bohan Li, Ruihua Song, Lei He, Sheng Zhao, Arul Menezes, Jiang Bian

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We design experiments on four language directions (German English, Spanish English, Chinese English), and the results show that Video Dubber achieves better length control ability on the generated speech than baseline methods.
Researcher Affiliation Collaboration Yihan Wu1*, Junliang Guo2, Xu Tan2, Chen Zhang3, Bohan Li3, Ruihua Song1 , Lei He3, Sheng Zhao3, Arul Menezes4, Jiang Bian2 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Microsoft Research Asia, 3 Microsoft Azure Speech, 4 Microsoft Azure Translation
Pseudocode No The paper describes its methods in prose and with diagrams (Figure 2 shows an architecture diagram), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper states 'Demo samples are available at https: //speechresearch.github.io/videodubbing/' and 'The test set can be built following https://speechresearch. github.io/videodubbing/'. These links are for demo samples and instructions to build the test set, not for the open-source code of the methodology itself, and there's no explicit statement about code release.
Open Datasets Yes For language direction from Others English, we use public speech-to-speech translation dataset CVSS (Jia et al. 2022), which contains multilingual-to English speech-to-speech translation corpora derived from the Co Vo ST2 dataset (Wang et al. 2021). For the direction from English to Chinese, we use the En-Zh subset of Mu STC (Cattoni et al. 2021), an English Others speech translation corpus built from English TED Talks.
Dataset Splits No The paper mentions training on certain datasets and evaluating on test sets, but it does not explicitly provide specific percentages or sample counts for training, validation, and test splits, nor does it reference predefined splits with sufficient detail for reproduction.
Hardware Specification No The paper describes model configurations and training settings but does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions software like 'Transformer', 'Ada Speech 4', and 'Montreal forced alignment (MFA)' but does not provide specific version numbers for these or any other ancillary software dependencies, which are necessary for reproducibility.
Experiment Setup Yes For all settings, we set the hidden dimension as 512 for the model, 2048 for the feed-forward layers. For both encoder and decoder, we use 6 layers and 8 heads for the multi-head attention. The duration predictor consists of a two-layer convolutional network with Re LU activation, each followed by the layer normalization and the dropout layer. The duration predictor is optimized with the mean square error (MSE) loss, taking the ground truth duration as training target.