UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis
Authors: Yi Lei, Shan Yang, Xinsheng Wang, Qicong Xie, Jixun Yao, Lei Xie, Dan Su
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted on two speakers and two singers demonstrate that Uni Syn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of Uni Syn. |
| Researcher Affiliation | Collaboration | 1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi an, China 2 Tencent AI Lab, China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Audio samples are available at: https://leiyi420.github.io/ Uni Syn (This link is for audio samples, not source code for the methodology.) |
| Open Datasets | Yes | 1) the Opencpop 1 (Wang et al. 2022), an open-source singing corpus... https://wenet.org.cn/opencpop/ ; 1) an open-source Mandarin TTS dataset 2 recorded from a female speaker... https://www.data-baker.com/open source.html |
| Dataset Splits | Yes | To balance the amount of data for speech and singing, we randomly select about 2 hours of audio from each speaker for training. For validation and evaluation, 100 utterances from the rest data and two preserved songs from each singing corpus are involved. |
| Hardware Specification | Yes | All the above models are trained with 4 NVIDIA V100 GPUs for fair comparison. |
| Software Dependencies | No | The paper mentions tools like WORLD, NANSY, Praat, and HMM-based force alignment model, but does not provide specific version numbers for these or any other software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) or their dependencies. |
| Experiment Setup | Yes | We set α = 60, β = 12, γ = 1.5, λ = 10, µ = 0.02, η = 2, θ = 2, and ϕ = 1.5 in our model empirically. We down-sample all the speech and singing audios into 24k Hz, and set the frame size and hop size to 1200 and 300 respectively when extracting optional auxiliary acoustic features like pitch and spectrogram. |