Hierarchical Generative Modeling for Controllable Speech Synthesis
Authors: Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, we train a high-quality controllable TTS model on real found data, which is capable of inferring speaker and style attributes from a noisy utterance and use it to synthesize clean speech with controllable speaking style. The proposed model is extensively evaluated on four datasets with subjective and objective quantitative metrics, as well as comprehensive qualitative studies. |
| Researcher Affiliation | Collaboration | Wei-Ning Hsu1 Yu Zhang2 Ron J. Weiss2 Heiga Zen2 Yonghui Wu2 Yuxuan Wang2 Yuan Cao2 Ye Jia2 Zhifeng Chen2 Jonathan Shen2 Patrick Nguyen2 Ruoming Pang2 1Massachusetts Institute of Technology 2Google Inc. wnhsu@csail.mit.edu, {ngyuzh,ronw}@google.com |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found in the paper. |
| Open Source Code | No | The paper provides a link to a demo page for audio samples (https://google.github.io/tacotron/publications/gmvae_controllable_tts) but no explicit statement or link for the open-source code of the methodology described in the paper. It mentions 'This dataset will be open-sourced soon.' (for a dataset, not code) but not the code for the method itself. |
| Open Datasets | Yes | We used an audiobook dataset2 derived from the same subset of Libri Vox audiobooks used for the Libri Speech corpus (Panayotov et al., 2015), but sampled at 24k Hz and segmented differently, making it appropriate for TTS instead of speech recognition. |
| Dataset Splits | No | The paper mentions 'train-clean-{100,360} partitions' and 'evaluation set' but does not provide specific percentages or counts for training, validation, and test splits for all datasets, nor does it refer to a citation for a standard split that includes all three for full reproducibility. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are provided. The paper mentions the 'Google Brain' team but no explicit hardware specifications. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer (Kingma & Ba, 2015)' and refers to 'Wave RNN-based neural vocoder (Kalchbrenner et al., 2018)', but it does not specify software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch versions, or other libraries). |
| Experiment Setup | Yes | The network is trained using the Adam optimizer (Kingma & Ba, 2015), configured with an initial learning rate 10 3, and an exponential decay that halved the learning rate every 12.5k steps, beginning after 50k steps. Parameters of the network are initialized using Xavier initialization (Glorot & Bengio, 2010). A batch size of 256 is used for all experiments. A list of detailed hyperparameter settings can be found in Appendix C. Table 7: Prior hyperparameters for each dataset used in Section 4. |