UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Authors: Yi Lei, Shan Yang, Xinsheng Wang, Qicong Xie, Jixun Yao, Lei Xie, Dan Su

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments conducted on two speakers and two singers demonstrate that Uni Syn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of Uni Syn.
Researcher Affiliation Collaboration 1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi an, China 2 Tencent AI Lab, China
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No Audio samples are available at: https://leiyi420.github.io/ Uni Syn (This link is for audio samples, not source code for the methodology.)
Open Datasets Yes 1) the Opencpop 1 (Wang et al. 2022), an open-source singing corpus... https://wenet.org.cn/opencpop/ ; 1) an open-source Mandarin TTS dataset 2 recorded from a female speaker... https://www.data-baker.com/open source.html
Dataset Splits Yes To balance the amount of data for speech and singing, we randomly select about 2 hours of audio from each speaker for training. For validation and evaluation, 100 utterances from the rest data and two preserved songs from each singing corpus are involved.
Hardware Specification Yes All the above models are trained with 4 NVIDIA V100 GPUs for fair comparison.
Software Dependencies No The paper mentions tools like WORLD, NANSY, Praat, and HMM-based force alignment model, but does not provide specific version numbers for these or any other software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) or their dependencies.
Experiment Setup Yes We set α = 60, β = 12, γ = 1.5, λ = 10, µ = 0.02, η = 2, θ = 2, and ϕ = 1.5 in our model empirically. We down-sample all the speech and singing audios into 24k Hz, and set the frame size and hop size to 1200 and 300 respectively when extracting optional auxiliary acoustic features like pitch and spectrogram.