DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Authors: Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Zhou Zhao11020-11028

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The evaluations conducted on a Chinese singing dataset demonstrate that Diff Singer outperforms state-of-the-art SVS work. Extensional experiments also prove the generalization of our methods on text-to-speech task (Diff Speech).
Researcher Affiliation Academia Jinglin Liu*, Chengxi Li*, Yi Ren*, Feiyang Chen, Zhou Zhao Zhejiang University {jinglinliu,chengxili,rayeren,zhaozhou}@zju.edu.cn, chenfeiyangai@gmail.com
Pseudocode Yes Algorithm 1: Training procedure of Diff Singer. Algorithm 2: Inference procedure of Diff Singer.
Open Source Code Yes Codes: https://github.com/ Moon In The River/Diff Singer.
Open Datasets Yes Since there is no publicly available high-quality unaccompanied singing dataset, we collect and annotate a Chinese Mandarin pop songs dataset: Pop CS, to evaluate our methods. [...] The codes accompanied with the access to Pop CS are in https://github.com/Moon In The River/Diff Singer. [...] We conduct the extensional experiments on LJSpeech dataset (Ito and Johnson 2017), which contains 13,100 English audio clips (total 24 hours) with corresponding transcripts.
Dataset Splits Yes We randomly choose 2 songs for validation and testing. [...] We follow the train-val-test dataset splits, the pre-processing of mel-spectrograms, and the grapheme-tophoneme tool in Fast Speech 2.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It only mentions general setups like using a vocoder.
Software Dependencies No The paper mentions software components and tools such as pypinyin, Parselmouth, and Parallel Wave GAN (PWG) vocoder, but it does not specify version numbers for these dependencies.
Experiment Setup Yes The channel size C mentioned before is set to 256. In the denoiser, the number of convolution layers N is 20 with the kernel size 3, and we set the dilation to 1 (without dilation) at each layer6. We set T to 100 and β to constants increasing linearly from β1 = 10 4 to βT = 0.06. The auxiliary decoder has the same setting as the mel-spectrogram decoder in Fast Speech 2. In the boundary predictor, the number of convolutional layers is 5, and the threshold is set to 0.4 empirically.