DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Authors: Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Zhou Zhao11020-11028
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The evaluations conducted on a Chinese singing dataset demonstrate that Diff Singer outperforms state-of-the-art SVS work. Extensional experiments also prove the generalization of our methods on text-to-speech task (Diff Speech). |
| Researcher Affiliation | Academia | Jinglin Liu*, Chengxi Li*, Yi Ren*, Feiyang Chen, Zhou Zhao Zhejiang University {jinglinliu,chengxili,rayeren,zhaozhou}@zju.edu.cn, chenfeiyangai@gmail.com |
| Pseudocode | Yes | Algorithm 1: Training procedure of Diff Singer. Algorithm 2: Inference procedure of Diff Singer. |
| Open Source Code | Yes | Codes: https://github.com/ Moon In The River/Diff Singer. |
| Open Datasets | Yes | Since there is no publicly available high-quality unaccompanied singing dataset, we collect and annotate a Chinese Mandarin pop songs dataset: Pop CS, to evaluate our methods. [...] The codes accompanied with the access to Pop CS are in https://github.com/Moon In The River/Diff Singer. [...] We conduct the extensional experiments on LJSpeech dataset (Ito and Johnson 2017), which contains 13,100 English audio clips (total 24 hours) with corresponding transcripts. |
| Dataset Splits | Yes | We randomly choose 2 songs for validation and testing. [...] We follow the train-val-test dataset splits, the pre-processing of mel-spectrograms, and the grapheme-tophoneme tool in Fast Speech 2. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It only mentions general setups like using a vocoder. |
| Software Dependencies | No | The paper mentions software components and tools such as pypinyin, Parselmouth, and Parallel Wave GAN (PWG) vocoder, but it does not specify version numbers for these dependencies. |
| Experiment Setup | Yes | The channel size C mentioned before is set to 256. In the denoiser, the number of convolution layers N is 20 with the kernel size 3, and we set the dilation to 1 (without dilation) at each layer6. We set T to 100 and β to constants increasing linearly from β1 = 10 4 to βT = 0.06. The auxiliary decoder has the same setting as the mel-spectrogram decoder in Fast Speech 2. In the boundary predictor, the number of convolutional layers is 5, and the threshold is set to 0.4 empirically. |