TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
Authors: Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He, Zhou Zhao
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on three language pairs demonstrate that Bi P yields an improvement of 2.9 BLEU on average compared with a baseline textless S2ST model. Moreover, our parallel decoding shows a significant reduction of inference latency, enabling speedup up to 21.4x than autoregressive technique. |
| Researcher Affiliation | Collaboration | 1Zhejiang University {rongjiehuang, jinglinliu, huadailiu, zhaozhou}@zju.edu.cn 2Byte Dance ren.yi@bytedance.com |
| Pseudocode | No | The paper describes algorithms and architectures through text and diagrams, but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions "Audio samples are available at https://TranSpeech.github.io/" but this is for audio samples, not the source code of the methodology. It also refers to publicly available third-party models but does not state that the code for their own method is released. |
| Open Datasets | Yes | For a fair comparison, we use the benchmark CVSS-C dataset (Jia et al., 2022), which is derived from the Co Vo ST 2 (Wang et al., 2020b) speech-to-text translation corpus by synthesizing the translation text into speech using a single-speaker TTS system. |
| Dataset Splits | No | The paper mentions using a "test set" for evaluation but does not explicitly provide details about training, validation, and test dataset splits (e.g., percentages or sample counts for each). |
| Hardware Specification | Yes | Tran Speech is trained until convergence for 200k steps using 1 Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions the "fairseq framework (Ott et al., 2019)" and "publicly-available pretrained multilingual Hu BERT (m Hu BERT) model and unit-based Hi Fi-GAN vocoder (Polyak et al., 2021; Kong et al., 2020)" but does not specify version numbers for these or other software components. |
| Experiment Setup | Yes | For bilateral perturbation, we finetune the publicly-available m Hu BERT model for each language separately with CTC loss until 25k updates using the Adam optimizer (β1 = 0.9, β2 = 0.98, ϵ = 10 8). Following the practice in textless S2ST (Lee et al., 2021b), we use the k-means algorithm to cluster the representation given by the well-tuned m Hu BERT into a vocabulary of 1000 units. Tran Speech computes 80-dimensional mel-filterbank features at every 10-ms for the source speech as input, and we set Nb to 6 in encoding and decoding blocks. In training the Tran Speech, we remove the auxiliary tasks for simplification and follow the unwritten language scenario. Tran Speech is trained until convergence for 200k steps using 1 Tesla V100 GPU. A comprehensive table of hyperparameters is available in Appendix B. |