reproducibilityindex.ai

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Authors: Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie LIU, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Michael Zeng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model. We evaluate the performance of Trans VIP for French-English mutual translation using a subset of the CVSS-T test set [8]. The results demonstrate that Trans VIP outperforms the publicly available SOTA models such as a larger Seamless Expressive model.
Researcher Affiliation	Collaboration	1Shanghai Jiao Tong University, China 2Microsoft, USA
Pseudocode	Yes	Appendix D Pseudo code for Layer Beam Search
Open Source Code	Yes	The training code and script are available at https://github.com/nethermanpro/transvip.
Open Datasets	Yes	We evaluate the performance of Trans VIP for French-English mutual translation using a subset of the CVSS-T test set [8] fr-en test set containing 300 utterances. [...] We utilize these datasets during training as follows: S2ST Dataset comprises quadruple of (X, Ts, Tt, Y ), where Ts is the source text and Tt is the target text. [...] ASR Dataset, which is highly accessible, consists of pairs (X, Ts). [...] Joint Translation model [...] trained using multiple datasets, including two S2ST datasets: CVSS-T[54] and Seamless Align13, one internal ST dataset, and one ASR dataset: Common Voice[55] version 15 (English and French subsets).
Dataset Splits	No	The paper mentions using a 'validation loss curve' and filtering data during training, but it does not provide specific details about the training/validation/test dataset splits (e.g., percentages or sample counts) for their custom setup, beyond using existing test sets like CVSS-T for evaluation.
Hardware Specification	Yes	All three models within the system are trained using 32 NVIDIA V100 32G GPUs.
Software Dependencies	No	The paper mentions using 'Fairseq2 libriary' and the 'Py Torch Lightning framework' but does not provide specific version numbers for these software dependencies (e.g., 'Fairseq2 vX.Y' or 'PyTorch Lightning vX.Y').
Experiment Setup	Yes	For decoding we employed a beam search algorithm with a beam size set to 5. [...] The beam size is 10, the sampling number is 20 and K is 3. [...] We use at most, a 10-second prompt for the joint translation model and a 5-second prompt for the NAR acoustic model. [...] The acoustic encoder is a six-layer standard Transformer encoder with a hidden size of 1024. This 12-layer transformer model is trained from scratch and utilizes mainly two unsupervised corpora.