Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech
Authors: Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations on zero-shot style transfer demonstrate that Gener Speech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that Gener Speech performs robustly in the few-shot data setting. |
| Researcher Affiliation | Collaboration | Rongjie Huang Zhejiang University EMAIL Yi Ren Sea AI Lab EMAIL |
| Pseudocode | Yes | See Algorithm 1 in Appendix B for the Py Torch-like pseudo-code. |
| Open Source Code | No | The checklist under 'If you ran experiments...' explicitly states '[No]' for 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?'. |
| Open Datasets | Yes | In the pre-training stage, we adopt the multi-emotion dataset IEMOCAP [35]... and the multi-speaker dataset Vox Celeb1 [32]... In the training stages, we utilize the Libri TTS [33] dataset... Additionally, we use part of the ESD database [54] |
| Dataset Splits | No | The paper does not explicitly provide percentages or absolute counts for training, validation, and test splits in the main text. It mentions 'OOD testing set' but no specific validation split. |
| Hardware Specification | Yes | After the 100,000 pre-training steps, we train Gener Speech for 200,000 steps using 1 NVIDIA 2080Ti GPU with a batch size of 64 sentences. |
| Software Dependencies | No | The paper mentions using 'Hi Fi-GAN[22] (V1) as the vocoder' but does not specify versions for other key software components or libraries required for reproduction. |
| Experiment Setup | Yes | After the 100,000 pre-training steps, we train Gener Speech for 200,000 steps using 1 NVIDIA 2080Ti GPU with a batch size of 64 sentences. Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9. Gener Speech consists of 4 feed-forward Transformer blocks for the phoneme encoder and mel-spectrogram decoder. The default size of the codebook in the vector quantization layer is set to 128. |