Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
Authors: Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun MA, Zhou Zhao
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Mega-TTS 2 could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes. |
| Researcher Affiliation | Collaboration | Zhejiang University & Byte Dance {ziyuejiang,zhaozhou}@zju.edu.cn, {liu.jinglin,ren.yi,yinxiang.stephen}@bytedance.com |
| Pseudocode | No | The paper provides architectural descriptions and procedural steps, but no formal pseudocode or algorithm blocks are included. |
| Open Source Code | No | Audio samples can be found in https://boostprompt.github.io/boostprompt/. (This links to samples, not code). No other specific code release statement found. |
| Open Datasets | Yes | We train Mega-TTS 2 and all baselines on Libri Light (Kahn et al., 2020), which contains 60K hours of unlabelled speech derived from Libri Vox audiobooks. |
| Dataset Splits | Yes | We randomly choose 20 speakers from the Libri Speech test-clean set and randomly choose 400 seconds of speeches for each of them. We split the 400 seconds of speech into a 300-second prompt set and a 100-second target set. |
| Hardware Specification | Yes | In the first training stage, we train the first-stage model on 4 NVIDIA A100 GPUs, with a batch size of 48 sentences on each GPU. In the second stage, we train the P-LLM and duration model on 8 NVIDIA A100 GPUs, with a batch size of 4,000 tokens on each GPU. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' and 'Hi Fi-GAN V1' but does not provide specific version numbers for these or other software dependencies, nor does it specify the programming language or framework versions. |
| Experiment Setup | Yes | We provide model configuration in Appendix A.4 and detailed hyperparameter settings in Table 5. |