TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Authors: Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie LIU, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Michael Zeng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model. We evaluate the performance of Trans VIP for French-English mutual translation using a subset of the CVSS-T test set [8]. The results demonstrate that Trans VIP outperforms the publicly available SOTA models such as a larger Seamless Expressive model.
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University, China 2Microsoft, USA
Pseudocode Yes Appendix D Pseudo code for Layer Beam Search
Open Source Code Yes The training code and script are available at https://github.com/nethermanpro/transvip.
Open Datasets Yes We evaluate the performance of Trans VIP for French-English mutual translation using a subset of the CVSS-T test set [8] fr-en test set containing 300 utterances. [...] We utilize these datasets during training as follows: S2ST Dataset comprises quadruple of (X, Ts, Tt, Y ), where Ts is the source text and Tt is the target text. [...] ASR Dataset, which is highly accessible, consists of pairs (X, Ts). [...] Joint Translation model [...] trained using multiple datasets, including two S2ST datasets: CVSS-T[54] and Seamless Align13, one internal ST dataset, and one ASR dataset: Common Voice[55] version 15 (English and French subsets).
Dataset Splits No The paper mentions using a 'validation loss curve' and filtering data during training, but it does not provide specific details about the training/validation/test dataset splits (e.g., percentages or sample counts) for their custom setup, beyond using existing test sets like CVSS-T for evaluation.
Hardware Specification Yes All three models within the system are trained using 32 NVIDIA V100 32G GPUs.
Software Dependencies No The paper mentions using 'Fairseq2 libriary' and the 'Py Torch Lightning framework' but does not provide specific version numbers for these software dependencies (e.g., 'Fairseq2 vX.Y' or 'PyTorch Lightning vX.Y').
Experiment Setup Yes For decoding we employed a beam search algorithm with a beam size set to 5. [...] The beam size is 10, the sampling number is 20 and K is 3. [...] We use at most, a 10-second prompt for the joint translation model and a 5-second prompt for the NAR acoustic model. [...] The acoustic encoder is a six-layer standard Transformer encoder with a hidden size of 1024. This 12-layer transformer model is trained from scratch and utilizes mainly two unsupervised corpora.