Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
Authors: Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie LIU, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Michael Zeng
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model. We evaluate the performance of Trans VIP for French-English mutual translation using a subset of the CVSS-T test set [8]. The results demonstrate that Trans VIP outperforms the publicly available SOTA models such as a larger Seamless Expressive model. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University, China 2Microsoft, USA |
| Pseudocode | Yes | Appendix D Pseudo code for Layer Beam Search |
| Open Source Code | Yes | The training code and script are available at https://github.com/nethermanpro/transvip. |
| Open Datasets | Yes | We evaluate the performance of Trans VIP for French-English mutual translation using a subset of the CVSS-T test set [8] fr-en test set containing 300 utterances. [...] We utilize these datasets during training as follows: S2ST Dataset comprises quadruple of (X, Ts, Tt, Y ), where Ts is the source text and Tt is the target text. [...] ASR Dataset, which is highly accessible, consists of pairs (X, Ts). [...] Joint Translation model [...] trained using multiple datasets, including two S2ST datasets: CVSS-T[54] and Seamless Align13, one internal ST dataset, and one ASR dataset: Common Voice[55] version 15 (English and French subsets). |
| Dataset Splits | No | The paper mentions using a 'validation loss curve' and filtering data during training, but it does not provide specific details about the training/validation/test dataset splits (e.g., percentages or sample counts) for their custom setup, beyond using existing test sets like CVSS-T for evaluation. |
| Hardware Specification | Yes | All three models within the system are trained using 32 NVIDIA V100 32G GPUs. |
| Software Dependencies | No | The paper mentions using 'Fairseq2 libriary' and the 'Py Torch Lightning framework' but does not provide specific version numbers for these software dependencies (e.g., 'Fairseq2 vX.Y' or 'PyTorch Lightning vX.Y'). |
| Experiment Setup | Yes | For decoding we employed a beam search algorithm with a beam size set to 5. [...] The beam size is 10, the sampling number is 20 and K is 3. [...] We use at most, a 10-second prompt for the joint translation model and a 5-second prompt for the NAR acoustic model. [...] The acoustic encoder is a six-layer standard Transformer encoder with a hidden size of 1024. This 12-layer transformer model is trained from scratch and utilizes mainly two unsupervised corpora. |