Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Authors: Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun MA, Zhou Zhao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Mega-TTS 2 could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes.
Researcher Affiliation Collaboration Zhejiang University & Byte Dance {ziyuejiang,zhaozhou}@zju.edu.cn, {liu.jinglin,ren.yi,yinxiang.stephen}@bytedance.com
Pseudocode No The paper provides architectural descriptions and procedural steps, but no formal pseudocode or algorithm blocks are included.
Open Source Code No Audio samples can be found in https://boostprompt.github.io/boostprompt/. (This links to samples, not code). No other specific code release statement found.
Open Datasets Yes We train Mega-TTS 2 and all baselines on Libri Light (Kahn et al., 2020), which contains 60K hours of unlabelled speech derived from Libri Vox audiobooks.
Dataset Splits Yes We randomly choose 20 speakers from the Libri Speech test-clean set and randomly choose 400 seconds of speeches for each of them. We split the 400 seconds of speech into a 300-second prompt set and a 100-second target set.
Hardware Specification Yes In the first training stage, we train the first-stage model on 4 NVIDIA A100 GPUs, with a batch size of 48 sentences on each GPU. In the second stage, we train the P-LLM and duration model on 8 NVIDIA A100 GPUs, with a batch size of 4,000 tokens on each GPU.
Software Dependencies No The paper mentions using 'Adam optimizer' and 'Hi Fi-GAN V1' but does not provide specific version numbers for these or other software dependencies, nor does it specify the programming language or framework versions.
Experiment Setup Yes We provide model configuration in Appendix A.4 and detailed hyperparameter settings in Table 5.