NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Authors: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Natural Speech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2Microsoft Research & Microsoft Azure 3The Chinese University of Hong Kong, Shenzhen 4Zhejiang University 5The University of Tokyo 6Peking University. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the code and pre-trained checkpoint of FACodec at https://huggingface.co/spaces/amphion/ naturalspeech3_facodec. |
| Open Datasets | Yes | We use Librilight (Kahn et al., 2020), which contains 60K hours of 16KHz unlabeled speech data and around 7000 distinct speakers from Libri Vox audiobooks, as the training set. |
| Dataset Splits | No | The paper does not explicitly provide validation dataset splits. It mentions 'training set' and 'test-clean' but no specific validation portion or details. |
| Hardware Specification | Yes | We use 8 A100 80GB GPUs with a batch size of 10K frames of latent vectors per GPU for 1M steps. |
| Software Dependencies | No | The paper mentions software components and models like 'Wav LM-TDCNN' and 'Hu BERT' but does not provide specific version numbers for these or other general software dependencies (e.g., Python, PyTorch). |
| Experiment Setup | Yes | We use 8 A100 80GB GPUs with a batch size of 10K frames of latent vectors per GPU for 1M steps. We use the Adam W optimizer with a learning rate of 1e 4, β1 = 0.9, and β2 = 0.98, 5K warmup steps following the inverse square root learning schedule. |