YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone
Authors: Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, Moacir A Ponti
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method builds upon the VITS model and adds several novel modifications for zero-shot multispeaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multispeaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multispeaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the Your TTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training. |
| Researcher Affiliation | Collaboration | 1Instituto de Ciˆencias Matem aticas e de Computac ao, Universidade de S ao Paulo, Brazil 2Coqui, Germany 3Sopra Banking Software, France 4Defined.ai, United States of America 5Federal University of Technology Paran a, Brazil 6Mercado Livre, Brazil. |
| Pseudocode | No | The paper includes diagrams (Figure 1) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | For reproducibility, our source-code is available at the Coqui TTS2 , as well as the model checkpoints of all experiments3. 2https://github.com/coqui-ai/TTS 3https://github.com/Edresson/Your TTS |
| Open Datasets | Yes | English: VCTK (Veaux et al., 2016) dataset... Furthermore, in some experiments we used the subsets train-clean-100 and train-clean-360 of the Libri TTS dataset (Zen et al., 2019)... Portuguese: TTS-Portuguese Corpus (Casanova et al., 2022)... French: fr FR set of the M-AILABS dataset (Munich Artificial Intelligence Laboratories Gmb H, 2017)... Finally, for speaker adaptation experiments, to mimic a more realistic setting, we used 4 speakers from the Common Voice dataset (Ardila et al., 2020). |
| Dataset Splits | Yes | We divided the VCTK dataset into: train, development (containing the same speakers as the train set) and test. ... For Portuguese we randomly selected 500 samples and the rest of the dataset was used for training. |
| Hardware Specification | Yes | The models were trained using an NVIDIA TESLA V100 32GB with a batch size of 64. |
| Software Dependencies | No | The paper mentions software like 'Adam W optimizer' and 'Hi Fi-GAN' but does not specify their version numbers. It also mentions 'Webrtcvad toolkit' and 'ffmpeg-normalize' but without versions. |
| Experiment Setup | Yes | In system 1, we start from a model trained 1M steps on LJSpeech (Ito et al., 2017) and continue the training for 200K steps with the VCTK dataset. ... For systems 2 and 3, training is done by continuing from the previous experiment for approximately 140k steps, learning one language at a time. In addition, for each experiment a fine-tuning was performed for 50k steps using the Speaker Consistency Loss (SCL), described in section 2, with α = 9. ... For the TTS model training and for the discrimination of vocoder Hi Fi-GAN we use the Adam W optimizer (Loshchilov & Hutter, 2017) with betas 0.8 and 0.99, weight decay 0.01 and an initial learning rate of 0.0002 decaying exponentially by a gamma of 0.999875 (Paszke et al., 2019). |