PromptTTS 2: Describing and Generating Voices with Text Prompt
Authors: Yichong Leng, Zhifang Guo, Kai Shen, Zeqian Ju, Xu Tan, Eric Liu, Yufei Liu, Dongchao Yang, leying zhang, Kaitao Song, Lei He, Xiangyang Li, sheng zhao, Tao Qin, Jiang Bian
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, Prompt TTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. |
| Researcher Affiliation | Collaboration | MOE-Microsoft Key Laboratory of Multimedia Computing and Communication, University of Science and Technology of China Microsoft |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The demo page of Prompt TTS 2 is available1. 1https://speechresearch.github.io/prompttts2. This link leads to a demo page, not a source code repository for the methodology described in the paper. |
| Open Datasets | Yes | For the speech dataset, we employ the English subset of the Multilingual Libri Speech (MLS) dataset (Pratap et al., 2020), which comprises 44K hours of transcribed speech data from Libri Vox audiobooks. For the text prompt data, we utilize Prompt Speech (Guo et al., 2023) that contains 20K text prompts written by human describing speech from four attributes including pitch, gender, volume, and speed. |
| Dataset Splits | No | The paper mentions a "test set" for the Prompt Speech dataset but does not explicitly state training/validation/test splits with percentages or absolute counts for the main MLS dataset or how a validation set was used for the primary experiments. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments (e.g., GPU models, CPU models, memory). |
| Software Dependencies | No | The paper refers to various models and tools used (e.g., Natural Speech 2, BERT-based model, diffusion model, WavLM-TDNN, GPT-3.5-TURBO) but does not provide specific version numbers for the overall software stack or libraries used for implementation. |
| Experiment Setup | Yes | The number of layers in the reference speech encoder and variation network is 6 and 12, respectively, with a hidden size of 512. The query number M, N in style module is both set to 8. Concerning the TTS backbone and the text prompt encoder, we adhere to the settings in Natural Speech 2 (Shen et al., 2023) and Prompt TTS (Guo et al., 2023), respectively. The training configuration is also derived from Natural Speech 2 (Shen et al., 2023). |