reproducibilityindex.ai

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Authors: Yi Ren, Jinglin Liu, Zhou Zhao

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Porta Speech outperforms other TTS models in both voice quality and prosody modeling in terms of subjective and objective evaluation metrics, and shows only a slight performance degradation when reducing the model parameters to 6.7M (about 4x model size and 3x runtime memory compression ratio compared with Fast Speech 2). Our extensive ablation studies demonstrate that each design in Porta Speech is effective3.
Researcher Affiliation	Academia	Yi Ren Zhejiang University rayeren@zju.edu.cn Jinglin Liu Zhejiang University jinglinliu@zju.edu.cn Zhou Zhao Zhejiang University zhaozhou@zju.edu.cn
Pseudocode	No	The paper describes the model architecture and procedures using text and diagrams, but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	Audio samples are available at https://portaspeech.github.io/. - This link is for audio samples only, not the source code for the methodology. There is no explicit statement about releasing the code for the paper's method.
Open Datasets	Yes	We evaluate Porta Speech on LJSpeech dataset [7], which contains 13100 English audio clips and corresponding text transcripts. and [7] Keith Ito. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
Dataset Splits	Yes	Following Fast Speech 2 [24], we split LJSpeech dataset into three subsets: 12229 samples for training, 348 samples (with document title LJ003) for validation and 523 samples (with document title LJ001 and LJ002) for testing.
Hardware Specification	Yes	We train the Porta Speech on 1 NVIDIA 2080Ti GPU, with batch size of 64 sentences on each GPU.
Software Dependencies	No	We convert the text sequence to the phoneme sequence [2, 25, 29, 30, 35] with an open-source grapheme-to-phoneme tool7. The output mel-spectrograms of our model are transformed into audio samples using Hi Fi-GAN [11]8 trained in advance. - The paper mentions specific tools and their uses, and links to their GitHub repositories (g2p, Hi Fi-GAN, pytorch_memlab), but does not provide version numbers for any of these tools or other key software dependencies (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	We train the Porta Speech on 1 NVIDIA 2080Ti GPU, with batch size of 64 sentences on each GPU. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, ε = 10 9 and follow the same learning rate schedule in [34]. It takes 320k steps for training until convergence.