reproducibilityindex.ai

AdaSpeech: Adaptive Text to Speech for Custom Voice

Authors: Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, sheng zhao, Tie-Yan Liu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment results show that Ada Speech achieves much better adaptation quality than baseline methods, with only about 5K speciﬁc parameters for each speaker, which demonstrates its effectiveness for custom voice. We conduct experiments to train the TTS model on Libri TTS datasets and adapt the model on VCTK and LJSpeech datasets with different adaptation settings.
Researcher Affiliation	Industry	Mingjian Chen , Xu Tan , Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, Tie-Yan Liu Microsoft Research Asia, Microsoft Azure Speech {xuta,taoqin,szhao,tyliu}@microsoft.com
Pseudocode	Yes	Algorithm 1 Pre-training, ﬁne-tuning and inference of Ada Speech
Open Source Code	No	The audio samples are available at https://speechresearch.github.io/adaspeech/. The paper states where audio samples are available, but does not provide a link or explicit statement for the source code of the methodology itself.
Open Datasets	Yes	We train the Ada Speech source model on Libri TTS (Zen et al., 2019) dataset, which is a multi-speaker corpus (2456 speakers) derived from Libri Speech (Panayotov et al., 2015) and contains 586 hours speech data. In order to evaluate Ada Speech in custom voice scenario, we adapt the source model to the voices in other datasets including VCTK (Veaux et al., 2016) (a multi-speaker datasets with 108 speakers and 44 hours speech data) and LJSpeech (Ito, 2017) (a single-speaker high-quality dataset with 24 hours speech data)
Dataset Splits	No	We use all the speakers in the training set of Libri TTS (exclude those chosen for adaptation) to train the source Ada Speech model, and use the original test sets in these datasets corresponding to the adaptation speakers to evaluate the adaptation voice quality. The paper mentions training and test sets but does not explicitly specify a validation split or its details.
Hardware Specification	Yes	We train Ada Speech on 4 NVIDIA P40 GPUs and each GPU has a batch size of about 12,500 speech frames.
Software Dependencies	No	Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9. We use MFA (Mc Auliffe et al., 2017) to extract the alignment between the phoneme and mel-spectrogram sequence... We use Mel GAN (Kumar et al., 2019) as the vocoder to synthesize waveform from the generated mel-spectrogram. The paper mentions the Adam optimizer and specific tools/models (MFA, Mel GAN) but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	The hidden dimension (including the phoneme embedding, speaker embedding, the hidden in self-attention, and the input and output hidden of feed-forward network) is set to 256. The number of attention heads, the feed-forward ﬁlter size and kernel size are set to 2, 1024 and 9 respectively. The output linear layer converts the 256-dimensional hidden into 80-dimensional mel-spectrogram... We ﬁrst train Ada Speech for 60,000 steps... Then we train Ada Speech and the phoneme-level acoustic predictor jointly for the remaining 40,000 steps... Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9. In the adaptation process, we ﬁne-tune Ada Speech on 1 NVIDIA P40 GPU for 2000 steps, where only the parameters of speaker embedding and conditional layer-normalization are optimized.