NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Authors: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Natural Speech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
Researcher Affiliation Collaboration 1University of Science and Technology of China 2Microsoft Research & Microsoft Azure 3The Chinese University of Hong Kong, Shenzhen 4Zhejiang University 5The University of Tokyo 6Peking University.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We release the code and pre-trained checkpoint of FACodec at https://huggingface.co/spaces/amphion/ naturalspeech3_facodec.
Open Datasets Yes We use Librilight (Kahn et al., 2020), which contains 60K hours of 16KHz unlabeled speech data and around 7000 distinct speakers from Libri Vox audiobooks, as the training set.
Dataset Splits No The paper does not explicitly provide validation dataset splits. It mentions 'training set' and 'test-clean' but no specific validation portion or details.
Hardware Specification Yes We use 8 A100 80GB GPUs with a batch size of 10K frames of latent vectors per GPU for 1M steps.
Software Dependencies No The paper mentions software components and models like 'Wav LM-TDCNN' and 'Hu BERT' but does not provide specific version numbers for these or other general software dependencies (e.g., Python, PyTorch).
Experiment Setup Yes We use 8 A100 80GB GPUs with a batch size of 10K frames of latent vectors per GPU for 1M steps. We use the Adam W optimizer with a learning rate of 1e 4, β1 = 0.9, and β2 = 0.98, 5K warmup steps following the inverse square root learning schedule.