AdaSpeech: Adaptive Text to Speech for Custom Voice
Authors: Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, sheng zhao, Tie-Yan Liu
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results show that Ada Speech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. We conduct experiments to train the TTS model on Libri TTS datasets and adapt the model on VCTK and LJSpeech datasets with different adaptation settings. |
| Researcher Affiliation | Industry | Mingjian Chen , Xu Tan , Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, Tie-Yan Liu Microsoft Research Asia, Microsoft Azure Speech {xuta,taoqin,szhao,tyliu}@microsoft.com |
| Pseudocode | Yes | Algorithm 1 Pre-training, fine-tuning and inference of Ada Speech |
| Open Source Code | No | The audio samples are available at https://speechresearch.github.io/adaspeech/. The paper states where audio samples are available, but does not provide a link or explicit statement for the source code of the methodology itself. |
| Open Datasets | Yes | We train the Ada Speech source model on Libri TTS (Zen et al., 2019) dataset, which is a multi-speaker corpus (2456 speakers) derived from Libri Speech (Panayotov et al., 2015) and contains 586 hours speech data. In order to evaluate Ada Speech in custom voice scenario, we adapt the source model to the voices in other datasets including VCTK (Veaux et al., 2016) (a multi-speaker datasets with 108 speakers and 44 hours speech data) and LJSpeech (Ito, 2017) (a single-speaker high-quality dataset with 24 hours speech data) |
| Dataset Splits | No | We use all the speakers in the training set of Libri TTS (exclude those chosen for adaptation) to train the source Ada Speech model, and use the original test sets in these datasets corresponding to the adaptation speakers to evaluate the adaptation voice quality. The paper mentions training and test sets but does not explicitly specify a validation split or its details. |
| Hardware Specification | Yes | We train Ada Speech on 4 NVIDIA P40 GPUs and each GPU has a batch size of about 12,500 speech frames. |
| Software Dependencies | No | Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9. We use MFA (Mc Auliffe et al., 2017) to extract the alignment between the phoneme and mel-spectrogram sequence... We use Mel GAN (Kumar et al., 2019) as the vocoder to synthesize waveform from the generated mel-spectrogram. The paper mentions the Adam optimizer and specific tools/models (MFA, Mel GAN) but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The hidden dimension (including the phoneme embedding, speaker embedding, the hidden in self-attention, and the input and output hidden of feed-forward network) is set to 256. The number of attention heads, the feed-forward filter size and kernel size are set to 2, 1024 and 9 respectively. The output linear layer converts the 256-dimensional hidden into 80-dimensional mel-spectrogram... We first train Ada Speech for 60,000 steps... Then we train Ada Speech and the phoneme-level acoustic predictor jointly for the remaining 40,000 steps... Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9. In the adaptation process, we fine-tune Ada Speech on 1 NVIDIA P40 GPU for 2000 steps, where only the parameters of speaker embedding and conditional layer-normalization are optimized. |