Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Authors: Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations on zero-shot style transfer demonstrate that Gener Speech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that Gener Speech performs robustly in the few-shot data setting.
Researcher Affiliation Collaboration Rongjie Huang Zhejiang University EMAIL Yi Ren Sea AI Lab EMAIL
Pseudocode Yes See Algorithm 1 in Appendix B for the Py Torch-like pseudo-code.
Open Source Code No The checklist under 'If you ran experiments...' explicitly states '[No]' for 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?'.
Open Datasets Yes In the pre-training stage, we adopt the multi-emotion dataset IEMOCAP [35]... and the multi-speaker dataset Vox Celeb1 [32]... In the training stages, we utilize the Libri TTS [33] dataset... Additionally, we use part of the ESD database [54]
Dataset Splits No The paper does not explicitly provide percentages or absolute counts for training, validation, and test splits in the main text. It mentions 'OOD testing set' but no specific validation split.
Hardware Specification Yes After the 100,000 pre-training steps, we train Gener Speech for 200,000 steps using 1 NVIDIA 2080Ti GPU with a batch size of 64 sentences.
Software Dependencies No The paper mentions using 'Hi Fi-GAN[22] (V1) as the vocoder' but does not specify versions for other key software components or libraries required for reproduction.
Experiment Setup Yes After the 100,000 pre-training steps, we train Gener Speech for 200,000 steps using 1 NVIDIA 2080Ti GPU with a batch size of 64 sentences. Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9. Gener Speech consists of 4 feed-forward Transformer blocks for the phoneme encoder and mel-spectrogram decoder. The default size of the codebook in the vector quantization layer is set to 128.