P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting

Authors: Sungwon Kim, Kevin Shih, rohan badlani, Joao Felipe Santos, Evelina Bakhturina, Mikyas Desta, Rafael Valle, Sungroh Yoon, Bryan Catanzaro

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments: Training and Inference Settings: P-Flow is trained on a single NVIDIA A100 GPU for 800K iterations, using a batch size of 64. We utilize the Adam W optimizer [26] with a learning rate of 0.0001.
Researcher Affiliation Collaboration Sungwon Kim1,2 , Kevin J Shih1, Rohan Badlani1, Jo ao Felipe Santos1, Evelina Bhakturina1, Mikyas Desta1, Rafael Valle1 , Sungroh Yoon2,3 , Bryan Catanzaro1... 1Work done as a research intern at NVIDIA. Corresponding authors: Sungwon Kim: ksw0306@snu.ac.kr, Rafael Valle: rafaelvalle@nvidia.com, Sungroh Yoon: sryoon@snu.ac.kr... 2Department of Electrical and Computer Engineering, Seoul National University 3Interdisciplinary Program in Artificial Intelligence, Seoul National University
Pseudocode No The paper describes algorithms and processes in text and diagrams but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes We provide audio samples on our demo page.6... 6 Demo: https://research.nvidia.com/labs/adlr/projects/pflow
Open Datasets Yes Data: We train P-Flow on Libri TTS [41]. Libri TTS training set consists of 580 hours of data from 2,456 speakers. We specifically use data that is longer than 3 seconds for speech prompting, yielding a 256 hours subset.
Dataset Splits No The paper mentions training on Libri TTS and evaluating on Libri Speech test-clean but does not explicitly provide details for a validation dataset split.
Hardware Specification Yes P-Flow is trained on a single NVIDIA A100 GPU for 800K iterations, using a batch size of 64.
Software Dependencies No The paper mentions various software components such as PyTorch transformer, AdamW optimizer, G2P model, Hifi-GAN, HuBERT ASR model, and WavLM-TDNN, but does not provide specific version numbers for any of them.
Experiment Setup Yes Training and Inference Settings: P-Flow is trained on a single NVIDIA A100 GPU for 800K iterations, using a batch size of 64. We utilize the Adam W optimizer [26] with a learning rate of 0.0001. ... Table 9: Hyperparameters of P-Flow