reproducibilityindex.ai

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Authors: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Natural Speech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
Researcher Affiliation	Collaboration	1University of Science and Technology of China 2Microsoft Research & Microsoft Azure 3The Chinese University of Hong Kong, Shenzhen 4Zhejiang University 5The University of Tokyo 6Peking University.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release the code and pre-trained checkpoint of FACodec at https://huggingface.co/spaces/amphion/ naturalspeech3_facodec.
Open Datasets	Yes	We use Librilight (Kahn et al., 2020), which contains 60K hours of 16KHz unlabeled speech data and around 7000 distinct speakers from Libri Vox audiobooks, as the training set.
Dataset Splits	No	The paper does not explicitly provide validation dataset splits. It mentions 'training set' and 'test-clean' but no specific validation portion or details.
Hardware Specification	Yes	We use 8 A100 80GB GPUs with a batch size of 10K frames of latent vectors per GPU for 1M steps.
Software Dependencies	No	The paper mentions software components and models like 'Wav LM-TDCNN' and 'Hu BERT' but does not provide specific version numbers for these or other general software dependencies (e.g., Python, PyTorch).
Experiment Setup	Yes	We use 8 A100 80GB GPUs with a batch size of 10K frames of latent vectors per GPU for 1M steps. We use the Adam W optimizer with a learning rate of 1e 4, β1 = 0.9, and β2 = 0.98, 5K warmup steps following the inverse square root learning schedule.