reproducibilityindex.ai

Hierarchical Generative Modeling for Controllable Speech Synthesis

Authors: Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, we train a high-quality controllable TTS model on real found data, which is capable of inferring speaker and style attributes from a noisy utterance and use it to synthesize clean speech with controllable speaking style. The proposed model is extensively evaluated on four datasets with subjective and objective quantitative metrics, as well as comprehensive qualitative studies.
Researcher Affiliation	Collaboration	Wei-Ning Hsu1 Yu Zhang2 Ron J. Weiss2 Heiga Zen2 Yonghui Wu2 Yuxuan Wang2 Yuan Cao2 Ye Jia2 Zhifeng Chen2 Jonathan Shen2 Patrick Nguyen2 Ruoming Pang2 1Massachusetts Institute of Technology 2Google Inc. wnhsu@csail.mit.edu, {ngyuzh,ronw}@google.com
Pseudocode	No	No pseudocode or clearly labeled algorithm blocks were found in the paper.
Open Source Code	No	The paper provides a link to a demo page for audio samples (https://google.github.io/tacotron/publications/gmvae_controllable_tts) but no explicit statement or link for the open-source code of the methodology described in the paper. It mentions 'This dataset will be open-sourced soon.' (for a dataset, not code) but not the code for the method itself.
Open Datasets	Yes	We used an audiobook dataset2 derived from the same subset of Libri Vox audiobooks used for the Libri Speech corpus (Panayotov et al., 2015), but sampled at 24k Hz and segmented differently, making it appropriate for TTS instead of speech recognition.
Dataset Splits	No	The paper mentions 'train-clean-{100,360} partitions' and 'evaluation set' but does not provide specific percentages or counts for training, validation, and test splits for all datasets, nor does it refer to a citation for a standard split that includes all three for full reproducibility.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are provided. The paper mentions the 'Google Brain' team but no explicit hardware specifications.
Software Dependencies	No	The paper mentions using 'Adam optimizer (Kingma & Ba, 2015)' and refers to 'Wave RNN-based neural vocoder (Kalchbrenner et al., 2018)', but it does not specify software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch versions, or other libraries).
Experiment Setup	Yes	The network is trained using the Adam optimizer (Kingma & Ba, 2015), conﬁgured with an initial learning rate 10 3, and an exponential decay that halved the learning rate every 12.5k steps, beginning after 50k steps. Parameters of the network are initialized using Xavier initialization (Glorot & Bengio, 2010). A batch size of 256 is used for all experiments. A list of detailed hyperparameter settings can be found in Appendix C. Table 7: Prior hyperparameters for each dataset used in Section 4.