Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling
Authors: Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, Zhizheng Wu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first describe the implementation details and datasets (Section 4.1). We then present the speech reconstruction results of Ta Di Codec in Section 4.2, including the main results (Section 4.2.1, Table 1), multilingual performance (Table 2), subjective evaluation results (Table 3), and ablation studies on tokenizer design (Section 4.2.2, Table 4). Section 4.3 reports the zero-shot TTS results of models built upon Ta Di Codec (Table 5), along with results on model size scaling and training and inference efficiency (Table 6), and an analysis of the reconstruction generation gap (Figure 3). |
| Researcher Affiliation | Academia | Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, Zhizheng Wu The Chinese University of Hong Kong, Shenzhen EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology and algorithms in detailed text format, for example, in Section 3.1 'Speech Tokenization with Diffusion Transformer Autoencoder' and Appendix B 'Flow Matching', but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We will open source our code and model checkpoints. Audio samples are are available at https:/tadicodec.github.io/. We release code and model checkpoints at https://github.com/Amphion Team/Ta Di Codec. |
| Open Datasets | Yes | We use the Emilia [60] dataset to train all of our models. Emilia is a multilingual and diverse in-the-wild speech dataset designed for large-scale speech generation. It contains 46.8K hours of English, 49.9K hours of Chinese, 1.6K hours of German, 1.4K hours of French, 1.7K hours of Japanese, and 0.2K hours of Korean. ... Seed TTS test-en We adopt a test set introduced in Seed-TTS [2], consisting of 1,000 samples drawn from English public corpora, including the Common Voice dataset [85]. |
| Dataset Splits | Yes | We report the main results of our models and baselines on eight test sets in Table 5. Our models exhibit significant improvements in intelligibility while maintaining speaker similarity comparable to state-of-the-art zero-shot TTS systems. In addition, we report performance on more challenging test sets, proposed in [68], covering articulatory scenarios (such as repeated words and tongue twisters), code-switching, and cross-lingual settings. ... Seed TTS test-en ... 1,000 samples ... Seed TTS test-zh ... 2,000 samples ... Articulatory en, Articulatory zh ... Each set contains 400 samples. ... Code-switching en, Code-switching zh ... Each set contains 500 samples. ... Cross-lingual zh2en, Cross-lingual en2zh ... each comprising 500 samples. ... Multilingual test sets We additionally construct four multilingual test sets ... For each language, we randomly sample 300 utterances from Common Voice [85]. |
| Hardware Specification | Yes | All models are trained on 8 80GB NVIDIA A100 GPUs using dynamic batching with 200 seconds of speech per batch. |
| Software Dependencies | No | The paper mentions several tools and models like 'Llama-style Transformer blocks [61]', 'Ro PE positional embedding [62]', 'RMSNorm [63]', 'Adam W [66]', 'whisper-large-v3 [36]', and 'paraformer-zh [37]'. However, it does not specify explicit version numbers for general software dependencies such as Python, PyTorch, or CUDA, which are crucial for full reproducibility. |
| Experiment Setup | Yes | The base configuration employs an 8-layer encoder and a 16-layer decoder, each with hidden size 1024, intermediate size 4096, and 16 attention heads. ... For vector quantization, we use BSQ [53] with a latent size of 14, yielding a codebook size of 214 = 16384. ... We train the tokenizer for 800K steps using Adam W [66] with a learning rate of 7.5e-5 and 32K warmup steps. TTS models are trained for 300K steps with a learning rate of 3e-4 unless otherwise specified. AR models extend the vocabulary of pretrained textual LLMs [3, 5] and are trained with 0.2B, 0.5B, 3.0B, and 4.0B parameters; see Section 4.3 for analysis. |