reproducibilityindex.ai

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Authors: Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluations on zero-shot style transfer demonstrate that Gener Speech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that Gener Speech performs robustly in the few-shot data setting.
Researcher Affiliation	Collaboration	Rongjie Huang Zhejiang University rongjiehuang@zju.edu.cn Yi Ren Sea AI Lab renyi@sea.com
Pseudocode	Yes	See Algorithm 1 in Appendix B for the Py Torch-like pseudo-code.
Open Source Code	No	The checklist under 'If you ran experiments...' explicitly states '[No]' for 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?'.
Open Datasets	Yes	In the pre-training stage, we adopt the multi-emotion dataset IEMOCAP [35]... and the multi-speaker dataset Vox Celeb1 [32]... In the training stages, we utilize the Libri TTS [33] dataset... Additionally, we use part of the ESD database [54]
Dataset Splits	No	The paper does not explicitly provide percentages or absolute counts for training, validation, and test splits in the main text. It mentions 'OOD testing set' but no specific validation split.
Hardware Specification	Yes	After the 100,000 pre-training steps, we train Gener Speech for 200,000 steps using 1 NVIDIA 2080Ti GPU with a batch size of 64 sentences.
Software Dependencies	No	The paper mentions using 'Hi Fi-GAN[22] (V1) as the vocoder' but does not specify versions for other key software components or libraries required for reproduction.
Experiment Setup	Yes	After the 100,000 pre-training steps, we train Gener Speech for 200,000 steps using 1 NVIDIA 2080Ti GPU with a batch size of 64 sentences. Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9. Gener Speech consists of 4 feed-forward Transformer blocks for the phoneme encoder and mel-spectrogram decoder. The default size of the codebook in the vector quantization layer is set to 128.