FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Authors: Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that 1) Fast Speech 2 achieves a 3x training speed-up over Fast Speech, and Fast Speech 2s enjoys even faster inference speed; 2) Fast Speech 2 and 2s outperform Fast Speech in voice quality, and Fast Speech 2 can even surpass autoregressive models. Audio samples are available at https://speechresearch.github.io/fastspeech2/.
Researcher Affiliation Collaboration 1Zhejiang University {rayeren,chenxuhu,zhaozhou}@zju.edu.cn 2Microsoft Research Asia {xuta,taoqin,tyliu}@microsoft.com 3Microsoft Azure Speech Sheng.Zhao@microsoft.com
Pseudocode No The paper does not contain explicit pseudocode or algorithm blocks, only architectural diagrams.
Open Source Code No The paper mentions audio samples available at a URL, and refers to third-party tools (g2p, Parallel Wave GAN, Py World Vocoder) but does not provide a direct link or explicit statement for the open-sourcing of *their* FastSpeech 2/2s methodology code.
Open Datasets Yes Datasets We evaluate Fast Speech 2 and 2s on LJSpeech dataset (Ito, 2017). LJSpeech contains 13,100 English audio clips (about 24 hours) and corresponding text transcripts.
Dataset Splits Yes We split the dataset into three sets: 12,228 samples for training, 349 samples (with document title LJ003) for validation and 523 samples (with document title LJ001 and LJ002) for testing.
Hardware Specification Yes The training and inference latency tests are conducted on a server with 36 Intel Xeon CPUs, 256GB memory, 1 NVIDIA V100 GPU and batch size of 48 for training and 1 for inference.
Software Dependencies No The paper mentions software components like 'Adam optimizer', 'Py World Vocoder', 'Montreal forced alignment (MFA) tool', 'Parallel Wave GAN', and 'g2p tool', but does not specify version numbers for these, nor for general programming environments like Python or PyTorch.
Experiment Setup Yes Model Configuration Our Fast Speech 2 consists of 4 feed-forward Transformer (FFT) blocks... The output linear layer in the decoder converts the hidden states into 80-dimensional mel-spectrograms and our model is optimized with mean absolute error (MAE). We add more detailed configurations of Fast Speech 2 and 2s used in our experiments in Appendix A. The details of training and inference are added in Appendix B. Table 7: Hyperparameters of Transformer TTS, Fast Speech and Fast Speech 2/2s.