PolyVoice: Language Models for Speech to Speech Translation

Authors: Qian qian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, yunlong zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun MA, Yuping Wang, Mingxuan Wang, Yuxuan Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our system on Chinese English and English Spanish language pairs. Experimental results demonstrate that Poly Voice outperforms the state-of-the-art encoder-decoder model, producing voice-cloned speech with high translation and audio quality.
Researcher Affiliation Industry Qianqian Dong , Zhiying Huang , Qiao Tian, Chen Xu, Tom Ko , Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang Byte Dance {dongqianqian, huangzhiying.92, tom.ko}@bytedance.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No Speech samples are available at https://polyvoice.github.io. (This link is for speech samples, not the source code of the methodology). The paper also references several third-party open-source tools/models but does not provide its own source code.
Open Datasets Yes U-XLM is trained by cross-lingual unit data, which is extracted from the audio by Hu BERT (Hsu et al., 2021) models. For Chinese audio, we utilize an open-source model based on Wenet Speech Chinese speech1. For English and Spanish audio, we use an open-source multilingual model (English, Spanish and French)2... The S2S data is sourced from Wenet Speech (Zhang et al., 2022) and Giga Speech (Chen et al., 2021)... We utilize the Libri Light (Kahn et al., 2020) and the in-house ASR datasets.
Dataset Splits No The paper mentions evaluating on benchmark datasets but does not explicitly provide specific numerical training/validation/test splits (e.g., percentages or counts) or detailed methodology for splitting the data for reproduction beyond using established test sets.
Hardware Specification Yes U-XLM is trained on 8/32 NVIDIA TESLA A100 80GB GPUs... We train the models using 8 NVIDIA TESLA A100 80GB GPUs...
Software Dependencies No The paper mentions several software components like Hu BERT, GPT2, Sound Stream, VALL-E X, sacrebleu, NISQA, fairseq, and huggingface/tokenizers, but it does not specify their version numbers for reproducibility.
Experiment Setup Yes U-XLM s model architecture is a unidirectional Transformer decoder consisting of 48 layers with hidden size 1600, feed-forward network (FFN) size 6400, and 25 attention heads. The total parameters are 1.6B. U-XLM is trained on ... with a batch size of 3072 tokens per GPU for 500k steps. In the U2S back-end, the U-SLM consists of 12 transformer layers. Each of these layers comprises 16 attention heads, an attention dimension of 1024, and an FFN dimension of 4096... We train the models using ... with a batch size of 8 utterances per GPU for 800k steps.