P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
Authors: Sungwon Kim, Kevin Shih, rohan badlani, Joao Felipe Santos, Evelina Bakhturina, Mikyas Desta, Rafael Valle, Sungroh Yoon, Bryan Catanzaro
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments: Training and Inference Settings: P-Flow is trained on a single NVIDIA A100 GPU for 800K iterations, using a batch size of 64. We utilize the Adam W optimizer [26] with a learning rate of 0.0001. |
| Researcher Affiliation | Collaboration | Sungwon Kim1,2 , Kevin J Shih1, Rohan Badlani1, Jo ao Felipe Santos1, Evelina Bhakturina1, Mikyas Desta1, Rafael Valle1 , Sungroh Yoon2,3 , Bryan Catanzaro1... 1Work done as a research intern at NVIDIA. Corresponding authors: Sungwon Kim: ksw0306@snu.ac.kr, Rafael Valle: rafaelvalle@nvidia.com, Sungroh Yoon: sryoon@snu.ac.kr... 2Department of Electrical and Computer Engineering, Seoul National University 3Interdisciplinary Program in Artificial Intelligence, Seoul National University |
| Pseudocode | No | The paper describes algorithms and processes in text and diagrams but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide audio samples on our demo page.6... 6 Demo: https://research.nvidia.com/labs/adlr/projects/pflow |
| Open Datasets | Yes | Data: We train P-Flow on Libri TTS [41]. Libri TTS training set consists of 580 hours of data from 2,456 speakers. We specifically use data that is longer than 3 seconds for speech prompting, yielding a 256 hours subset. |
| Dataset Splits | No | The paper mentions training on Libri TTS and evaluating on Libri Speech test-clean but does not explicitly provide details for a validation dataset split. |
| Hardware Specification | Yes | P-Flow is trained on a single NVIDIA A100 GPU for 800K iterations, using a batch size of 64. |
| Software Dependencies | No | The paper mentions various software components such as PyTorch transformer, AdamW optimizer, G2P model, Hifi-GAN, HuBERT ASR model, and WavLM-TDNN, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | Training and Inference Settings: P-Flow is trained on a single NVIDIA A100 GPU for 800K iterations, using a batch size of 64. We utilize the Adam W optimizer [26] with a learning rate of 0.0001. ... Table 9: Hyperparameters of P-Flow |