CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
Authors: Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that CLa M-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed. |
| Researcher Affiliation | Industry | Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho KRAFTON {jay.310,keonlee,s.j.chung,jwcho}@krafton.com |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found. |
| Open Source Code | No | If our potential legal concerns can be addressed, we are prepared to progressively disclose, for research purposes, the inference code, pre-trained weights, and ultimately, the full training implementation. |
| Open Datasets | Yes | We employ 100K hours of over 12K distinct speakers speech-transcript dataset spanning 11 languages: English, Korean, Chinese, Japanese, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. We provide details of dataset for each language in Appendix B.1, and data pre-processing in Appendix B.2 and B.3. ... In Appendix B.1, datasets like MLS (Pratap et al., 2020), Giga Speech (Chen et al., 2021), Libri TTS-R (Koizumi et al., 2023), VCTK (Veaux et al., 2016), and LJSpeech (Ito & Johnson, 2017) are cited. |
| Dataset Splits | No | We employ a subset of the Libri Speech test-clean dataset. ... z is sampled with temperature (Kingma & Dhariwal, 2018) of 2.6, which matches the empirical standard deviation in our validation dataset. There is a mention of a "validation dataset" but no explicit split percentages or sizes for train/validation/test are provided to reproduce the data partitioning. |
| Hardware Specification | Yes | (1) Mel-VAE: We train the model on 4 NVIDIA A100 40GB GPUs for around 2M steps. ... (2) Text-to-code: ... The model is trained on 4 NVIDIA A100 40GB GPUs for around 4M steps... |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) were provided for the overall experimental setup. |
| Experiment Setup | Yes | Training (1) Mel-VAE: ... We use Adam optimizer (Kingma & Ba, 2015) with a constant learning rate of 0.0002 throughout the training. ... (2) Text-to-code: ... We use Adam W optimizer (Loshchilov & Hutter, 2019), and the learning rate is fixed to 0.0002 throughout the training. Throughout all our experiments, during the model inference, we sample k using top-p sampling (Holtzman et al., 2020) with 0.5 and z is sampled with temperature (Kingma & Dhariwal, 2018) of 2.6... |