PortaSpeech: Portable and High-Quality Generative Text-to-Speech
Authors: Yi Ren, Jinglin Liu, Zhou Zhao
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Porta Speech outperforms other TTS models in both voice quality and prosody modeling in terms of subjective and objective evaluation metrics, and shows only a slight performance degradation when reducing the model parameters to 6.7M (about 4x model size and 3x runtime memory compression ratio compared with Fast Speech 2). Our extensive ablation studies demonstrate that each design in Porta Speech is effective3. |
| Researcher Affiliation | Academia | Yi Ren Zhejiang University rayeren@zju.edu.cn Jinglin Liu Zhejiang University jinglinliu@zju.edu.cn Zhou Zhao Zhejiang University zhaozhou@zju.edu.cn |
| Pseudocode | No | The paper describes the model architecture and procedures using text and diagrams, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Audio samples are available at https://portaspeech.github.io/. - This link is for audio samples only, not the source code for the methodology. There is no explicit statement about releasing the code for the paper's method. |
| Open Datasets | Yes | We evaluate Porta Speech on LJSpeech dataset [7], which contains 13100 English audio clips and corresponding text transcripts. and [7] Keith Ito. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017. |
| Dataset Splits | Yes | Following Fast Speech 2 [24], we split LJSpeech dataset into three subsets: 12229 samples for training, 348 samples (with document title LJ003) for validation and 523 samples (with document title LJ001 and LJ002) for testing. |
| Hardware Specification | Yes | We train the Porta Speech on 1 NVIDIA 2080Ti GPU, with batch size of 64 sentences on each GPU. |
| Software Dependencies | No | We convert the text sequence to the phoneme sequence [2, 25, 29, 30, 35] with an open-source grapheme-to-phoneme tool7. The output mel-spectrograms of our model are transformed into audio samples using Hi Fi-GAN [11]8 trained in advance. - The paper mentions specific tools and their uses, and links to their GitHub repositories (g2p, Hi Fi-GAN, pytorch_memlab), but does not provide version numbers for any of these tools or other key software dependencies (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | We train the Porta Speech on 1 NVIDIA 2080Ti GPU, with batch size of 64 sentences on each GPU. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, ε = 10 9 and follow the same learning rate schedule in [34]. It takes 320k steps for training until convergence. |