Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models
Authors: Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS |
| Researcher Affiliation | Academia | Xin Zhang , Dong Zhang , Shimin Li, Yaqian Zhou , Xipeng Qiu School of Computer Science, Fudan University Shanghai Key Laboratory of Intelligent Information Processing, Fudan University EMAIL EMAIL |
| Pseudocode | No | The paper describes its methods and models in text and figures, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/. |
| Open Datasets | Yes | Datasets For Speech Tokenizer training, we use Libri Speech (Panayotov et al., 2015) dataset. ... For zero-shot TTS, we train AR and NAR models on the English subset of Multilingual Libri Speech (Pratap et al., 2020) dataset |
| Dataset Splits | Yes | We train the downstream model on Libri Speech train-clean100 subset and use dev-clean subset for estimating mutual information. |
| Hardware Specification | Yes | For Speech Tokenizer, the model are trained on 2 A800 GPUS |
| Software Dependencies | No | The paper describes model architectures and training setups but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | For downstream model training, we configure the training setup with a batch size of 32, a learning rate of 1e-4, and a total of 200k global steps. For Speech Tokenizer, the model are trained on 2 A800 GPUS for 20 epochs with maximum learning rate of 4e-4 and batch size of 20 per GPU. For Unified Speech Language Model, both AR and NAR models are trained on 8 A800 GPUS for 500k steps with maximum learning rate of 5e-4. The AR model is trained with batch size of 7500 tokens per GPU, and the NAR model is trained with batch size of 5000 tokens per GPU. |