reproducibilityindex.ai

SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models

Authors: Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENTS
Researcher Affiliation	Academia	Xin Zhang , Dong Zhang , Shimin Li, Yaqian Zhou , Xipeng Qiu School of Computer Science, Fudan University Shanghai Key Laboratory of Intelligent Information Processing, Fudan University {xin_zhang22,dongzhang22}@m.fudan.edu.cn {smli20,zhouyaqian,xpqiu}@fudan.edu.cn
Pseudocode	No	The paper describes its methods and models in text and figures, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.
Open Datasets	Yes	Datasets For Speech Tokenizer training, we use Libri Speech (Panayotov et al., 2015) dataset. ... For zero-shot TTS, we train AR and NAR models on the English subset of Multilingual Libri Speech (Pratap et al., 2020) dataset
Dataset Splits	Yes	We train the downstream model on Libri Speech train-clean100 subset and use dev-clean subset for estimating mutual information.
Hardware Specification	Yes	For Speech Tokenizer, the model are trained on 2 A800 GPUS
Software Dependencies	No	The paper describes model architectures and training setups but does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	For downstream model training, we configure the training setup with a batch size of 32, a learning rate of 1e-4, and a total of 200k global steps. For Speech Tokenizer, the model are trained on 2 A800 GPUS for 20 epochs with maximum learning rate of 4e-4 and batch size of 20 per GPU. For Unified Speech Language Model, both AR and NAR models are trained on 8 A800 GPUS for 500k steps with maximum learning rate of 5e-4. The AR model is trained with batch size of 7500 tokens per GPU, and the NAR model is trained with batch size of 5000 tokens per GPU.