reproducibilityindex.ai

FastSpeech: Fast, Robust and Controllable Text to Speech

Authors: Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on the LJSpeech dataset to test Fast Speech. The results show that in terms of speech quality, Fast Speech nearly matches the autoregressive Transformer model. Furthermore, Fast Speech achieves 270x speedup on mel-spectrogram generation and 38x speedup on ﬁnal speech synthesis compared with the autoregressive Transformer TTS model, almost eliminates the problem of word skipping and repeating, and can adjust voice speed smoothly. We conduct the MOS (mean opinion score) evaluation on the test set to measure the audio quality. We conduct ablation studies to verify the effectiveness of several components in Fast Speech, including 1D Convolution and sequence-level knowledge distillation.
Researcher Affiliation	Collaboration	Yi Ren Zhejiang University rayeren@zju.edu.cn Yangjun Ruan Zhejiang University ruanyj3107@zju.edu.cn Xu Tan Microsoft Research xuta@microsoft.com Tao Qin Microsoft Research taoqin@microsoft.com Sheng Zhao Microsoft STC Asia Sheng.Zhao@microsoft.com Zhou Zhao Zhejiang University zhaozhou@zju.edu.cn Tie-Yan Liu Microsoft Research tyliu@microsoft.com
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described in this paper. It only links to synthesized speech samples and a third-party tool (Wave Glow).
Open Datasets	Yes	We conduct experiments on LJSpeech dataset [10], which contains 13,100 English audio clips and the corresponding text transcripts, with the total audio length of approximate 24 hours. We randomly split the dataset into 3 sets: 12500 samples for training, 300 samples for validation and 300 samples for testing. In order to alleviate the mispronunciation problem, we convert the text sequence into the phoneme sequence with our internal grapheme-to-phoneme conversion tool [23], following [1, 22, 27]. For the speech data, we convert the raw waveform into mel-spectrograms following [22]. Our frame size and hop size are set to 1024 and 256, respectively. [10] Keith Ito. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
Dataset Splits	Yes	We randomly split the dataset into 3 sets: 12500 samples for training, 300 samples for validation and 300 samples for testing.
Hardware Specification	Yes	The evaluation is conducted on a server with 12 Intel Xeon CPU, 256GB memory, 1 NVIDIA V100 GPU and batch size of 1.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers. It mentions the Adam optimizer and Wave Glow but without versioning information.
Experiment Setup	Yes	We use the Adam optimizer with β1 = 0.9, β2 = 0.98, ε = 10 9 and follow the same learning rate schedule in [25]. It takes 80k steps for training until convergence. We train the Fast Speech model together with the duration predictor. The optimizer and other hyper-parameters for Fast Speech are the same as the autoregressive Transformer TTS model. The Fast Speech model training takes about 80k steps on 4 NVIDIA V100 GPUs. Fast Speech model consists of 6 FFT blocks on both the phoneme side and the mel-spectrogram side. The size of the phoneme vocabulary is 51, including punctuations. The dimension of phoneme embeddings, the hidden size of the self-attention and 1D convolution in the FFT block are all set to 384. The number of attention heads is set to 2. The kernel sizes of the 1D convolution in the 2-layer convolutional network are both set to 3, with input/output size of 384/1536 for the ﬁrst layer and 1536/384 in the second layer. The output linear layer converts the 384-dimensional hidden into 80-dimensional mel-spectrogram. In our duration predictor, the kernel sizes of the 1D convolution are set to 3, with input/output sizes of 384/384 for both layers.