reproducibilityindex.ai

Bag of Tricks for Unsupervised Text-to-Speech

Authors: Yi Ren, Chen Zhang, Shuicheng YAN

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct experiments to evaluate the effectiveness of our proposed method for unsupervised TTS. We first describe the experiment settings, show the results of our method, and conduct some analyses of our method.
Researcher Affiliation	Collaboration	Yi Ren1, Chen Zhang2,1, Shuicheng Yan1 1SEA AI Lab, 2Zhejiang University rayeren613@gmail.com, zc99@zju.edu.cn, yansc@sea.com
Pseudocode	Yes	The detailed unsupervised training algorithm is shown in Algorithm 1. Algorithm 1 Unsupervised TTS Training
Open Source Code	No	The paper provides a link (https://unsupertts-tricks.github.io) to generated samples, but this is a project demonstration page and does not explicitly provide concrete access to the source code for the methodology described in the paper.
Open Datasets	Yes	We choose the speech and text data from Common Voice dataset (Ardila et al., 2019) for training and English and Indonesian as the target low-resource languages4. [...] We use LJSpeech (Ito, 2017) as the Sref to provide the speaker timbre and suppress the background noise for the voice conversion model.
Dataset Splits	Yes	We split the target language data into two halves. We take unpaired speech data from the first half and text data from the second, so as to guarantee the speech and text data are disjoint. We randomly select 100 utterances in English and Indonesian for validation and another 100 utterances in Indonesian for testing.
Hardware Specification	Yes	We train our VC, TTS, and ASR models on 1 NVIDIA A100 GPU witch batch size 128.
Software Dependencies	No	The paper does not explicitly provide specific version numbers for software dependencies or libraries used in the implementation of its models and experiments.
Experiment Setup	Yes	We use the Adam optimizer with β1 = 0.9, β2 = 0.98, ε = 10 9 and learning rate 2e-4. The training takes nearly 3 days. The output mel-spectrograms are converted to waveform using a Hi Fi-GAN (Kong et al., 2020) pre-trained on LJSpeech (Ito, 2017). The focus rate Fthres, Nsteps, pcat and paux in back-translation are set to 0.2, 20k, 0.2 and 0.2.