GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Authors: Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations on zero-shot style transfer demonstrate that Gener Speech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that Gener Speech performs robustly in the few-shot data setting.
Researcher Affiliation Collaboration Rongjie Huang Zhejiang University rongjiehuang@zju.edu.cn Yi Ren Sea AI Lab renyi@sea.com
Pseudocode Yes See Algorithm 1 in Appendix B for the Py Torch-like pseudo-code.
Open Source Code No The checklist under 'If you ran experiments...' explicitly states '[No]' for 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?'.
Open Datasets Yes In the pre-training stage, we adopt the multi-emotion dataset IEMOCAP [35]... and the multi-speaker dataset Vox Celeb1 [32]... In the training stages, we utilize the Libri TTS [33] dataset... Additionally, we use part of the ESD database [54]
Dataset Splits No The paper does not explicitly provide percentages or absolute counts for training, validation, and test splits in the main text. It mentions 'OOD testing set' but no specific validation split.
Hardware Specification Yes After the 100,000 pre-training steps, we train Gener Speech for 200,000 steps using 1 NVIDIA 2080Ti GPU with a batch size of 64 sentences.
Software Dependencies No The paper mentions using 'Hi Fi-GAN[22] (V1) as the vocoder' but does not specify versions for other key software components or libraries required for reproduction.
Experiment Setup Yes After the 100,000 pre-training steps, we train Gener Speech for 200,000 steps using 1 NVIDIA 2080Ti GPU with a batch size of 64 sentences. Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9. Gener Speech consists of 4 feed-forward Transformer blocks for the phoneme encoder and mel-spectrogram decoder. The default size of the codebook in the vector quantization layer is set to 128.