Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space
Authors: Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, Min zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis, showing its potential for broader applications in general-purpose speech language models. Demos and code are available at https://github.com/ictnlp/SLED-TTS. |
| Researcher Affiliation | Collaboration | Zhengrui Ma 1,2,3, Yang Feng 1,2 * , Chenze Shao 3, Fandong Meng 3, Jie Zhou 3, Min Zhang 4 1 Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences 2 University of Chinese Academy of Sciences 3 Pattern Recognition Center, We Chat AI, Tencent Inc 4 School of Future Science and Engineering, Soochow University |
| Pseudocode | No | The paper describes the methodology in prose and mathematical formulations but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Demos and code are available at https://github.com/ictnlp/SLED-TTS. |
| Open Datasets | Yes | We train SLED on the large-scale Libriheavy [29] dataset. Libri Heavy contains approximately 50,000 hours of speech from 6,736 speakers, deriving from audiobooks from the Libri Vox project. A BPE tokenizer [58] with a vocabulary size of 16,384 is applied for text. We evaluate zero-shot speech synthesis performance using Libri Speech test-clean set |
| Dataset Splits | Yes | We evaluate zero-shot speech synthesis performance using Libri Speech test-clean set, ensuring that none of the test speakers are included in the training data. Following [72, 46], we use samples with durations between 4 and 10 seconds, resulting in a 2.2-hour subset comprising 1,234 samples and 40 unique speakers. For each speaker, the i-th speech sample is synthesized using the (i 1)-th sample as the prompt, while the first sample is synthesized using the last sample from the same speaker as the prompt. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU models, memory amounts) used for running its experiments beyond general mentions of compute and GPUs (e.g., "GPU" in Appendix C, but no specific model). |
| Software Dependencies | No | The paper mentions software components like "Adam W" [41], "BF16" (presumably a data type used by a framework like PyTorch), but does not specify version numbers for Python, PyTorch, CUDA, or other key libraries. |
| Experiment Setup | Yes | We train the model with a batch size of 512 for 300,000 steps using BF16. We optimize the model with Adam W [41], configured with a learning rate of 5e-4, weight decay of 0.01, β1 = 0.9, β2 = 0.999 and ϵ = 1 10 8. The learning rate follows a linear decay, warming up to its peak value during the first 32,000 steps. A maximum gradient norm clip of 1.0 is applied. |