StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis
Authors: Yu Zhang, Rongjie Huang, Ruiqi Li, JinZheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive evaluations in zero-shot style transfer undeniably establish that Style Singer outperforms baseline models in both audio quality and similarity to the reference singing voice samples. Our comprehensive evaluations in zero-shot style transfer establish that Style Singer surpasses the baseline models in singing quality and similarity to the reference style. Extensive experiments in zero-shot style transfer show that Style Singer exhibits superior audio quality and similarity compared with baseline models. We train our model for 20000 steps using 1 NVIDIA 2080Ti GPU. Adam optimizer is used with β1 = 0.9, β2 = 0.98. It takes about 24 hours for training on 1 NVIDIA 2080Ti GPU. In our experimental analysis, we employ both objective and subjective evaluation metrics to assess the synthesis quality and style similarity of the test set. |
| Researcher Affiliation | Collaboration | Yu Zhang1, Rongjie Huang1, Ruiqi Li1, Jin Zheng He1, Yan Xia1, Feiyang Chen2, Xinyu Duan2, Baoxing Huai2, Zhou Zhao1* 1Zhejiang University 2Huawei Cloud |
| Pseudocode | Yes | For the pseudo-code of the algorithm, please refer to Algorithm 1 provided in Appendix B. |
| Open Source Code | No | The paper states: "Access to singing voice samples can be found at https://stylesinger.github.io/". This link is for samples, not code. Furthermore, the ethics statement mentions: "Therefore, we will impose restrictions on our code and models to prevent unauthorized usage." This explicitly indicates that the code is not open-source. |
| Open Datasets | Yes | Additionally, to include more acoustic variation, we incorporate the M4Singer dataset (Zhang et al. 2022a) (including 20 singers and 30 hours), which is used under license CC BY-NC-SA 4.0. |
| Dataset Splits | No | The paper mentions collecting and annotating a Chinese song corpus and incorporating the M4Singer dataset. It specifies that "20 sentences with unseen styles to construct the OOD testing set". However, it does not provide explicit training, validation, and test split percentages or counts for the entire dataset used for training, nor does it specify how validation was performed beyond general evaluation metrics. |
| Hardware Specification | Yes | We train our model for 20000 steps using 1 NVIDIA 2080Ti GPU. It takes about 24 hours for training on 1 NVIDIA 2080Ti GPU. |
| Software Dependencies | No | The paper mentions using "pypinyin to convert Chinese lyrics into phonemes" and "parselmouth (Jadoul, Thompson, and De Boer 2018) to extract f0 information". However, it does not specify version numbers for these software components or any other key libraries or frameworks used. |
| Experiment Setup | Yes | We train our model for 20000 steps using 1 NVIDIA 2080Ti GPU. Adam optimizer is used with β1 = 0.9, β2 = 0.98. It takes about 24 hours for training on 1 NVIDIA 2080Ti GPU. We utilize pypinyin to convert Chinese lyrics into phonemes. We extract mel-spectrograms from raw waveforms and set the sample rate to 48000Hz, the window size to 1024, the hop size to 256, and the number of mel bins to 80. The default size of the codebook in the RQ is set to 128, and the depth of the RQ is 4. |