Bag of Tricks for Unsupervised Text-to-Speech

Authors: Yi Ren, Chen Zhang, Shuicheng YAN

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct experiments to evaluate the effectiveness of our proposed method for unsupervised TTS. We first describe the experiment settings, show the results of our method, and conduct some analyses of our method.
Researcher Affiliation Collaboration Yi Ren1, Chen Zhang2,1, Shuicheng Yan1 1SEA AI Lab, 2Zhejiang University rayeren613@gmail.com, zc99@zju.edu.cn, yansc@sea.com
Pseudocode Yes The detailed unsupervised training algorithm is shown in Algorithm 1. Algorithm 1 Unsupervised TTS Training
Open Source Code No The paper provides a link (https://unsupertts-tricks.github.io) to generated samples, but this is a project demonstration page and does not explicitly provide concrete access to the source code for the methodology described in the paper.
Open Datasets Yes We choose the speech and text data from Common Voice dataset (Ardila et al., 2019) for training and English and Indonesian as the target low-resource languages4. [...] We use LJSpeech (Ito, 2017) as the Sref to provide the speaker timbre and suppress the background noise for the voice conversion model.
Dataset Splits Yes We split the target language data into two halves. We take unpaired speech data from the first half and text data from the second, so as to guarantee the speech and text data are disjoint. We randomly select 100 utterances in English and Indonesian for validation and another 100 utterances in Indonesian for testing.
Hardware Specification Yes We train our VC, TTS, and ASR models on 1 NVIDIA A100 GPU witch batch size 128.
Software Dependencies No The paper does not explicitly provide specific version numbers for software dependencies or libraries used in the implementation of its models and experiments.
Experiment Setup Yes We use the Adam optimizer with β1 = 0.9, β2 = 0.98, ε = 10 9 and learning rate 2e-4. The training takes nearly 3 days. The output mel-spectrograms are converted to waveform using a Hi Fi-GAN (Kong et al., 2020) pre-trained on LJSpeech (Ito, 2017). The focus rate Fthres, Nsteps, pcat and paux in back-translation are set to 0.2, 20k, 0.2 and 0.2.