FastSpeech: Fast, Robust and Controllable Text to Speech
Authors: Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on the LJSpeech dataset to test Fast Speech. The results show that in terms of speech quality, Fast Speech nearly matches the autoregressive Transformer model. Furthermore, Fast Speech achieves 270x speedup on mel-spectrogram generation and 38x speedup on final speech synthesis compared with the autoregressive Transformer TTS model, almost eliminates the problem of word skipping and repeating, and can adjust voice speed smoothly. We conduct the MOS (mean opinion score) evaluation on the test set to measure the audio quality. We conduct ablation studies to verify the effectiveness of several components in Fast Speech, including 1D Convolution and sequence-level knowledge distillation. |
| Researcher Affiliation | Collaboration | Yi Ren Zhejiang University rayeren@zju.edu.cn Yangjun Ruan Zhejiang University ruanyj3107@zju.edu.cn Xu Tan Microsoft Research xuta@microsoft.com Tao Qin Microsoft Research taoqin@microsoft.com Sheng Zhao Microsoft STC Asia Sheng.Zhao@microsoft.com Zhou Zhao Zhejiang University zhaozhou@zju.edu.cn Tie-Yan Liu Microsoft Research tyliu@microsoft.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. It only links to synthesized speech samples and a third-party tool (Wave Glow). |
| Open Datasets | Yes | We conduct experiments on LJSpeech dataset [10], which contains 13,100 English audio clips and the corresponding text transcripts, with the total audio length of approximate 24 hours. We randomly split the dataset into 3 sets: 12500 samples for training, 300 samples for validation and 300 samples for testing. In order to alleviate the mispronunciation problem, we convert the text sequence into the phoneme sequence with our internal grapheme-to-phoneme conversion tool [23], following [1, 22, 27]. For the speech data, we convert the raw waveform into mel-spectrograms following [22]. Our frame size and hop size are set to 1024 and 256, respectively. [10] Keith Ito. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017. |
| Dataset Splits | Yes | We randomly split the dataset into 3 sets: 12500 samples for training, 300 samples for validation and 300 samples for testing. |
| Hardware Specification | Yes | The evaluation is conducted on a server with 12 Intel Xeon CPU, 256GB memory, 1 NVIDIA V100 GPU and batch size of 1. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. It mentions the Adam optimizer and Wave Glow but without versioning information. |
| Experiment Setup | Yes | We use the Adam optimizer with β1 = 0.9, β2 = 0.98, ε = 10 9 and follow the same learning rate schedule in [25]. It takes 80k steps for training until convergence. We train the Fast Speech model together with the duration predictor. The optimizer and other hyper-parameters for Fast Speech are the same as the autoregressive Transformer TTS model. The Fast Speech model training takes about 80k steps on 4 NVIDIA V100 GPUs. Fast Speech model consists of 6 FFT blocks on both the phoneme side and the mel-spectrogram side. The size of the phoneme vocabulary is 51, including punctuations. The dimension of phoneme embeddings, the hidden size of the self-attention and 1D convolution in the FFT block are all set to 384. The number of attention heads is set to 2. The kernel sizes of the 1D convolution in the 2-layer convolutional network are both set to 3, with input/output size of 384/1536 for the first layer and 1536/384 in the second layer. The output linear layer converts the 384-dimensional hidden into 80-dimensional mel-spectrogram. In our duration predictor, the kernel sizes of the 1D convolution are set to 3, with input/output sizes of 384/384 for both layers. |