Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
Authors: Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that CLa M-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed. |
| Researcher Affiliation | Industry | Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho KRAFTON EMAIL |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found. |
| Open Source Code | No | If our potential legal concerns can be addressed, we are prepared to progressively disclose, for research purposes, the inference code, pre-trained weights, and ultimately, the full training implementation. |
| Open Datasets | Yes | We employ 100K hours of over 12K distinct speakers speech-transcript dataset spanning 11 languages: English, Korean, Chinese, Japanese, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. We provide details of dataset for each language in Appendix B.1, and data pre-processing in Appendix B.2 and B.3. ... In Appendix B.1, datasets like MLS (Pratap et al., 2020), Giga Speech (Chen et al., 2021), Libri TTS-R (Koizumi et al., 2023), VCTK (Veaux et al., 2016), and LJSpeech (Ito & Johnson, 2017) are cited. |
| Dataset Splits | No | We employ a subset of the Libri Speech test-clean dataset. ... z is sampled with temperature (Kingma & Dhariwal, 2018) of 2.6, which matches the empirical standard deviation in our validation dataset. There is a mention of a "validation dataset" but no explicit split percentages or sizes for train/validation/test are provided to reproduce the data partitioning. |
| Hardware Specification | Yes | (1) Mel-VAE: We train the model on 4 NVIDIA A100 40GB GPUs for around 2M steps. ... (2) Text-to-code: ... The model is trained on 4 NVIDIA A100 40GB GPUs for around 4M steps... |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) were provided for the overall experimental setup. |
| Experiment Setup | Yes | Training (1) Mel-VAE: ... We use Adam optimizer (Kingma & Ba, 2015) with a constant learning rate of 0.0002 throughout the training. ... (2) Text-to-code: ... We use Adam W optimizer (Loshchilov & Hutter, 2019), and the learning rate is fixed to 0.0002 throughout the training. Throughout all our experiments, during the model inference, we sample k using top-p sampling (Holtzman et al., 2020) with 0.5 and z is sampled with temperature (Kingma & Dhariwal, 2018) of 2.6... |