Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Authors: Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Zhou Zhao11020-11028
AAAI 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The evaluations conducted on a Chinese singing dataset demonstrate that Diff Singer outperforms state-of-the-art SVS work. Extensional experiments also prove the generalization of our methods on text-to-speech task (Diff Speech). |
| Researcher Affiliation | Academia | Jinglin Liu*, Chengxi Li*, Yi Ren*, Feiyang Chen, Zhou Zhao Zhejiang University EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Training procedure of Diff Singer. Algorithm 2: Inference procedure of Diff Singer. |
| Open Source Code | Yes | Codes: https://github.com/ Moon In The River/Diff Singer. |
| Open Datasets | Yes | Since there is no publicly available high-quality unaccompanied singing dataset, we collect and annotate a Chinese Mandarin pop songs dataset: Pop CS, to evaluate our methods. [...] The codes accompanied with the access to Pop CS are in https://github.com/Moon In The River/Diff Singer. [...] We conduct the extensional experiments on LJSpeech dataset (Ito and Johnson 2017), which contains 13,100 English audio clips (total 24 hours) with corresponding transcripts. |
| Dataset Splits | Yes | We randomly choose 2 songs for validation and testing. [...] We follow the train-val-test dataset splits, the pre-processing of mel-spectrograms, and the grapheme-tophoneme tool in Fast Speech 2. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It only mentions general setups like using a vocoder. |
| Software Dependencies | No | The paper mentions software components and tools such as pypinyin, Parselmouth, and Parallel Wave GAN (PWG) vocoder, but it does not specify version numbers for these dependencies. |
| Experiment Setup | Yes | The channel size C mentioned before is set to 256. In the denoiser, the number of convolution layers N is 20 with the kernel size 3, and we set the dilation to 1 (without dilation) at each layer6. We set T to 100 and β to constants increasing linearly from β1 = 10 4 to βT = 0.06. The auxiliary decoder has the same setting as the mel-spectrogram decoder in Fast Speech 2. In the boundary predictor, the number of convolutional layers is 5, and the threshold is set to 0.4 empirically. |