EditSinger: Zero-Shot Text-Based Singing Voice Editing System with Diverse Prosody Modeling
Authors: Lichao Zhang, Zhou Zhao, Yi Ren, Liqun Deng
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments conducted on the Open Singer prove that Edit Singer can synthesize high-quality edited singing voices with natural prosody according to the corresponding operations. We conduct experiments on the singing datasets Open Singer [Huang et al., 2021] which consists of 50 hours of Chinese singing voices recorded in a professional recording studio and split Open Singer randomly by singer into the test set (songs of 3 females and 3 males) and the training set (all the songs of the remaining singers). |
| Researcher Affiliation | Collaboration | Lichao Zhang1 , Zhou Zhao1 , Yi Ren1 and Liqun Deng2 1Zhejiang University 2Huawei Noah s Ark Lab |
| Pseudocode | No | The paper provides architectural diagrams (Figure 1, Figure 2) but does not include pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper states: "Audio samples can be listened in https://editsinger.github.io/." This link is for audio samples, not the source code for the methodology. There is no explicit statement about code release. |
| Open Datasets | Yes | We conduct experiments on the singing datasets Open Singer [Huang et al., 2021] which consists of 50 hours of Chinese singing voices recorded in a professional recording studio and split Open Singer randomly by singer into the test set (songs of 3 females and 3 males) and the training set (all the songs of the remaining singers). |
| Dataset Splits | No | The paper mentions splitting data into a 'test set' and 'training set' but does not explicitly specify a separate 'validation' set or provide specific split percentages for training, validation, and test. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions tools and models like "Fast Speech 2", "Parallel Wave GAN (PWG)", "Pypinyin", "montreal forced alignment (MFA)", "parselmouth", and "resemblyzer", but it does not specify version numbers for these software components. |
| Experiment Setup | Yes | We stack 4 feed-forward Transformer blocks in both the encoder and decoder of the acoustic model and set the hidden size to 256, and the same configuration is also used in the MVA and F0 predictor of FPIP. The V/UV predictor consists of a 4-layer 1D-convolutional network. We minimize the MAE and SSIM [Wang et al., 2004] loss between the output mel-spectrograms and the ground truth mel-spectrograms to optimize the phoneme encoder and mel decoder. We randomly mask out 15% of the words in the lyrics... λ is a hyperparameter that weighs the importance of the three terms, which are all set to 1 in our experiments. |