PromptTTS 2: Describing and Generating Voices with Text Prompt

Authors: Yichong Leng, Zhifang Guo, Kai Shen, Zeqian Ju, Xu Tan, Eric Liu, Yufei Liu, Dongchao Yang, leying zhang, Kaitao Song, Lei He, Xiangyang Li, sheng zhao, Tao Qin, Jiang Bian

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, Prompt TTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation.
Researcher Affiliation Collaboration MOE-Microsoft Key Laboratory of Multimedia Computing and Communication, University of Science and Technology of China Microsoft
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The demo page of Prompt TTS 2 is available1. 1https://speechresearch.github.io/prompttts2. This link leads to a demo page, not a source code repository for the methodology described in the paper.
Open Datasets Yes For the speech dataset, we employ the English subset of the Multilingual Libri Speech (MLS) dataset (Pratap et al., 2020), which comprises 44K hours of transcribed speech data from Libri Vox audiobooks. For the text prompt data, we utilize Prompt Speech (Guo et al., 2023) that contains 20K text prompts written by human describing speech from four attributes including pitch, gender, volume, and speed.
Dataset Splits No The paper mentions a "test set" for the Prompt Speech dataset but does not explicitly state training/validation/test splits with percentages or absolute counts for the main MLS dataset or how a validation set was used for the primary experiments.
Hardware Specification No The paper does not explicitly describe the specific hardware used to run its experiments (e.g., GPU models, CPU models, memory).
Software Dependencies No The paper refers to various models and tools used (e.g., Natural Speech 2, BERT-based model, diffusion model, WavLM-TDNN, GPT-3.5-TURBO) but does not provide specific version numbers for the overall software stack or libraries used for implementation.
Experiment Setup Yes The number of layers in the reference speech encoder and variation network is 6 and 12, respectively, with a hidden size of 512. The query number M, N in style module is both set to 8. Concerning the TTS backbone and the text prompt encoder, we adhere to the settings in Natural Speech 2 (Shen et al., 2023) and Prompt TTS (Guo et al., 2023), respectively. The training configuration is also derived from Natural Speech 2 (Shen et al., 2023).