TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
Authors: Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie LIU, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Michael Zeng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model. We evaluate the performance of Trans VIP for French-English mutual translation using a subset of the CVSS-T test set [8]. The results demonstrate that Trans VIP outperforms the publicly available SOTA models such as a larger Seamless Expressive model. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University, China 2Microsoft, USA |
| Pseudocode | Yes | Appendix D Pseudo code for Layer Beam Search |
| Open Source Code | Yes | The training code and script are available at https://github.com/nethermanpro/transvip. |
| Open Datasets | Yes | We evaluate the performance of Trans VIP for French-English mutual translation using a subset of the CVSS-T test set [8] fr-en test set containing 300 utterances. [...] We utilize these datasets during training as follows: S2ST Dataset comprises quadruple of (X, Ts, Tt, Y ), where Ts is the source text and Tt is the target text. [...] ASR Dataset, which is highly accessible, consists of pairs (X, Ts). [...] Joint Translation model [...] trained using multiple datasets, including two S2ST datasets: CVSS-T[54] and Seamless Align13, one internal ST dataset, and one ASR dataset: Common Voice[55] version 15 (English and French subsets). |
| Dataset Splits | No | The paper mentions using a 'validation loss curve' and filtering data during training, but it does not provide specific details about the training/validation/test dataset splits (e.g., percentages or sample counts) for their custom setup, beyond using existing test sets like CVSS-T for evaluation. |
| Hardware Specification | Yes | All three models within the system are trained using 32 NVIDIA V100 32G GPUs. |
| Software Dependencies | No | The paper mentions using 'Fairseq2 libriary' and the 'Py Torch Lightning framework' but does not provide specific version numbers for these software dependencies (e.g., 'Fairseq2 vX.Y' or 'PyTorch Lightning vX.Y'). |
| Experiment Setup | Yes | For decoding we employed a beam search algorithm with a beam size set to 5. [...] The beam size is 10, the sampling number is 20 and K is 3. [...] We use at most, a 10-second prompt for the joint translation model and a 5-second prompt for the NAR acoustic model. [...] The acoustic encoder is a six-layer standard Transformer encoder with a hidden size of 1024. This 12-layer transformer model is trained from scratch and utilizes mainly two unsupervised corpora. |