reproducibilityindex.ai

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Authors: Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that 1) Fast Speech 2 achieves a 3x training speed-up over Fast Speech, and Fast Speech 2s enjoys even faster inference speed; 2) Fast Speech 2 and 2s outperform Fast Speech in voice quality, and Fast Speech 2 can even surpass autoregressive models. Audio samples are available at https://speechresearch.github.io/fastspeech2/.
Researcher Affiliation	Collaboration	1Zhejiang University {rayeren,chenxuhu,zhaozhou}@zju.edu.cn 2Microsoft Research Asia {xuta,taoqin,tyliu}@microsoft.com 3Microsoft Azure Speech Sheng.Zhao@microsoft.com
Pseudocode	No	The paper does not contain explicit pseudocode or algorithm blocks, only architectural diagrams.
Open Source Code	No	The paper mentions audio samples available at a URL, and refers to third-party tools (g2p, Parallel Wave GAN, Py World Vocoder) but does not provide a direct link or explicit statement for the open-sourcing of their FastSpeech 2/2s methodology code.
Open Datasets	Yes	Datasets We evaluate Fast Speech 2 and 2s on LJSpeech dataset (Ito, 2017). LJSpeech contains 13,100 English audio clips (about 24 hours) and corresponding text transcripts.
Dataset Splits	Yes	We split the dataset into three sets: 12,228 samples for training, 349 samples (with document title LJ003) for validation and 523 samples (with document title LJ001 and LJ002) for testing.
Hardware Specification	Yes	The training and inference latency tests are conducted on a server with 36 Intel Xeon CPUs, 256GB memory, 1 NVIDIA V100 GPU and batch size of 48 for training and 1 for inference.
Software Dependencies	No	The paper mentions software components like 'Adam optimizer', 'Py World Vocoder', 'Montreal forced alignment (MFA) tool', 'Parallel Wave GAN', and 'g2p tool', but does not specify version numbers for these, nor for general programming environments like Python or PyTorch.
Experiment Setup	Yes	Model Conﬁguration Our Fast Speech 2 consists of 4 feed-forward Transformer (FFT) blocks... The output linear layer in the decoder converts the hidden states into 80-dimensional mel-spectrograms and our model is optimized with mean absolute error (MAE). We add more detailed conﬁgurations of Fast Speech 2 and 2s used in our experiments in Appendix A. The details of training and inference are added in Appendix B. Table 7: Hyperparameters of Transformer TTS, Fast Speech and Fast Speech 2/2s.