CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

Authors: Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that CLa M-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed.
Researcher Affiliation Industry Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho KRAFTON {jay.310,keonlee,s.j.chung,jwcho}@krafton.com
Pseudocode No No explicit pseudocode or algorithm blocks were found.
Open Source Code No If our potential legal concerns can be addressed, we are prepared to progressively disclose, for research purposes, the inference code, pre-trained weights, and ultimately, the full training implementation.
Open Datasets Yes We employ 100K hours of over 12K distinct speakers speech-transcript dataset spanning 11 languages: English, Korean, Chinese, Japanese, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. We provide details of dataset for each language in Appendix B.1, and data pre-processing in Appendix B.2 and B.3. ... In Appendix B.1, datasets like MLS (Pratap et al., 2020), Giga Speech (Chen et al., 2021), Libri TTS-R (Koizumi et al., 2023), VCTK (Veaux et al., 2016), and LJSpeech (Ito & Johnson, 2017) are cited.
Dataset Splits No We employ a subset of the Libri Speech test-clean dataset. ... z is sampled with temperature (Kingma & Dhariwal, 2018) of 2.6, which matches the empirical standard deviation in our validation dataset. There is a mention of a "validation dataset" but no explicit split percentages or sizes for train/validation/test are provided to reproduce the data partitioning.
Hardware Specification Yes (1) Mel-VAE: We train the model on 4 NVIDIA A100 40GB GPUs for around 2M steps. ... (2) Text-to-code: ... The model is trained on 4 NVIDIA A100 40GB GPUs for around 4M steps...
Software Dependencies No No specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) were provided for the overall experimental setup.
Experiment Setup Yes Training (1) Mel-VAE: ... We use Adam optimizer (Kingma & Ba, 2015) with a constant learning rate of 0.0002 throughout the training. ... (2) Text-to-code: ... We use Adam W optimizer (Loshchilov & Hutter, 2019), and the learning rate is fixed to 0.0002 throughout the training. Throughout all our experiments, during the model inference, we sample k using top-p sampling (Holtzman et al., 2020) with 0.5 and z is sampled with temperature (Kingma & Dhariwal, 2018) of 2.6...