YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone

Authors: Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, Moacir A Ponti

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method builds upon the VITS model and adds several novel modifications for zero-shot multispeaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multispeaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multispeaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the Your TTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.
Researcher Affiliation Collaboration 1Instituto de Ciˆencias Matem aticas e de Computac ao, Universidade de S ao Paulo, Brazil 2Coqui, Germany 3Sopra Banking Software, France 4Defined.ai, United States of America 5Federal University of Technology Paran a, Brazil 6Mercado Livre, Brazil.
Pseudocode No The paper includes diagrams (Figure 1) but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes For reproducibility, our source-code is available at the Coqui TTS2 , as well as the model checkpoints of all experiments3. 2https://github.com/coqui-ai/TTS 3https://github.com/Edresson/Your TTS
Open Datasets Yes English: VCTK (Veaux et al., 2016) dataset... Furthermore, in some experiments we used the subsets train-clean-100 and train-clean-360 of the Libri TTS dataset (Zen et al., 2019)... Portuguese: TTS-Portuguese Corpus (Casanova et al., 2022)... French: fr FR set of the M-AILABS dataset (Munich Artificial Intelligence Laboratories Gmb H, 2017)... Finally, for speaker adaptation experiments, to mimic a more realistic setting, we used 4 speakers from the Common Voice dataset (Ardila et al., 2020).
Dataset Splits Yes We divided the VCTK dataset into: train, development (containing the same speakers as the train set) and test. ... For Portuguese we randomly selected 500 samples and the rest of the dataset was used for training.
Hardware Specification Yes The models were trained using an NVIDIA TESLA V100 32GB with a batch size of 64.
Software Dependencies No The paper mentions software like 'Adam W optimizer' and 'Hi Fi-GAN' but does not specify their version numbers. It also mentions 'Webrtcvad toolkit' and 'ffmpeg-normalize' but without versions.
Experiment Setup Yes In system 1, we start from a model trained 1M steps on LJSpeech (Ito et al., 2017) and continue the training for 200K steps with the VCTK dataset. ... For systems 2 and 3, training is done by continuing from the previous experiment for approximately 140k steps, learning one language at a time. In addition, for each experiment a fine-tuning was performed for 50k steps using the Speaker Consistency Loss (SCL), described in section 2, with α = 9. ... For the TTS model training and for the discrimination of vocoder Hi Fi-GAN we use the Adam W optimizer (Loshchilov & Hutter, 2017) with betas 0.8 and 0.99, weight decay 0.01 and an initial learning rate of 0.0002 decaying exponentially by a gamma of 0.999875 (Paszke et al., 2019).