GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech
Authors: Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations on zero-shot style transfer demonstrate that Gener Speech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that Gener Speech performs robustly in the few-shot data setting. |
| Researcher Affiliation | Collaboration | Rongjie Huang Zhejiang University rongjiehuang@zju.edu.cn Yi Ren Sea AI Lab renyi@sea.com |
| Pseudocode | Yes | See Algorithm 1 in Appendix B for the Py Torch-like pseudo-code. |
| Open Source Code | No | The checklist under 'If you ran experiments...' explicitly states '[No]' for 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?'. |
| Open Datasets | Yes | In the pre-training stage, we adopt the multi-emotion dataset IEMOCAP [35]... and the multi-speaker dataset Vox Celeb1 [32]... In the training stages, we utilize the Libri TTS [33] dataset... Additionally, we use part of the ESD database [54] |
| Dataset Splits | No | The paper does not explicitly provide percentages or absolute counts for training, validation, and test splits in the main text. It mentions 'OOD testing set' but no specific validation split. |
| Hardware Specification | Yes | After the 100,000 pre-training steps, we train Gener Speech for 200,000 steps using 1 NVIDIA 2080Ti GPU with a batch size of 64 sentences. |
| Software Dependencies | No | The paper mentions using 'Hi Fi-GAN[22] (V1) as the vocoder' but does not specify versions for other key software components or libraries required for reproduction. |
| Experiment Setup | Yes | After the 100,000 pre-training steps, we train Gener Speech for 200,000 steps using 1 NVIDIA 2080Ti GPU with a batch size of 64 sentences. Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9. Gener Speech consists of 4 feed-forward Transformer blocks for the phoneme encoder and mel-spectrogram decoder. The default size of the codebook in the vector quantization layer is set to 128. |