TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

Authors: Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He, Zhou Zhao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on three language pairs demonstrate that Bi P yields an improvement of 2.9 BLEU on average compared with a baseline textless S2ST model. Moreover, our parallel decoding shows a significant reduction of inference latency, enabling speedup up to 21.4x than autoregressive technique.
Researcher Affiliation Collaboration 1Zhejiang University {rongjiehuang, jinglinliu, huadailiu, zhaozhou}@zju.edu.cn 2Byte Dance ren.yi@bytedance.com
Pseudocode No The paper describes algorithms and architectures through text and diagrams, but does not include formal pseudocode or algorithm blocks.
Open Source Code No The paper mentions "Audio samples are available at https://TranSpeech.github.io/" but this is for audio samples, not the source code of the methodology. It also refers to publicly available third-party models but does not state that the code for their own method is released.
Open Datasets Yes For a fair comparison, we use the benchmark CVSS-C dataset (Jia et al., 2022), which is derived from the Co Vo ST 2 (Wang et al., 2020b) speech-to-text translation corpus by synthesizing the translation text into speech using a single-speaker TTS system.
Dataset Splits No The paper mentions using a "test set" for evaluation but does not explicitly provide details about training, validation, and test dataset splits (e.g., percentages or sample counts for each).
Hardware Specification Yes Tran Speech is trained until convergence for 200k steps using 1 Tesla V100 GPU.
Software Dependencies No The paper mentions the "fairseq framework (Ott et al., 2019)" and "publicly-available pretrained multilingual Hu BERT (m Hu BERT) model and unit-based Hi Fi-GAN vocoder (Polyak et al., 2021; Kong et al., 2020)" but does not specify version numbers for these or other software components.
Experiment Setup Yes For bilateral perturbation, we finetune the publicly-available m Hu BERT model for each language separately with CTC loss until 25k updates using the Adam optimizer (β1 = 0.9, β2 = 0.98, ϵ = 10 8). Following the practice in textless S2ST (Lee et al., 2021b), we use the k-means algorithm to cluster the representation given by the well-tuned m Hu BERT into a vocabulary of 1000 units. Tran Speech computes 80-dimensional mel-filterbank features at every 10-ms for the source speech as input, and we set Nb to 6 in encoding and decoding blocks. In training the Tran Speech, we remove the auxiliary tasks for simplification and follow the unwritten language scenario. Tran Speech is trained until convergence for 200k steps using 1 Tesla V100 GPU. A comprehensive table of hyperparameters is available in Appendix B.