Semi-Supervised Generative Modeling for Controllable Speech Synthesis

Authors: Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the efficacy of semi-supervised latent variable models for controllable TTS we trained the model described in section 2 on the above data-sets at varying levels of supervision as well as for varying settings of the hyperparameters: α which controls the supervision loss and γ, which over emphasizes supervised training points.
Researcher Affiliation Collaboration Raza Habib1 Soroosh Mariooryad2 Matt Shannon2 Eric Battenberg2 RJ Skerry-Ryan2 Daisy Stanton2 David Kao2 Tom Bagby2 1University College London (UCL) 2Google Research.
Pseudocode No The paper describes the model architecture and training procedure in text and figures but does not include explicit pseudocode or algorithm blocks.
Open Source Code No The paper provides links to audio samples and a demo page, but does not provide a specific link or statement for the release of source code for the methodology described.
Open Datasets Yes To verify the reproduciblity of our results on a public dataset, we trained models to control speaking rate and F0 variation on clean subset of Libri TTS dataset (Zen et al., 2019).
Dataset Splits Yes The training set consists of 72,405 utterances with durations of at most 5 seconds (45 hours). The validation and test sets each contain 745 utterances or roughly 30 minutes of data.
Hardware Specification Yes All models were trained using the ADAM optimizer with learning rate of 10 3 and run for 300, 000 training steps with a batch size of 256, distributed across 32 Google Cloud TPU chips.
Software Dependencies No All models were implemented using tensorflow 1 (Abadi et al., 2016). The paper mentions TensorFlow 1 but does not provide a specific version number for TensorFlow or any other software libraries used.
Experiment Setup Yes All models were trained using the ADAM optimizer with learning rate of 10 3 and run for 300, 000 training steps with a batch size of 256, distributed across 32 Google Cloud TPU chips.