SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models

Authors: Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS
Researcher Affiliation Academia Xin Zhang , Dong Zhang , Shimin Li, Yaqian Zhou , Xipeng Qiu School of Computer Science, Fudan University Shanghai Key Laboratory of Intelligent Information Processing, Fudan University {xin_zhang22,dongzhang22}@m.fudan.edu.cn {smli20,zhouyaqian,xpqiu}@fudan.edu.cn
Pseudocode No The paper describes its methods and models in text and figures, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.
Open Datasets Yes Datasets For Speech Tokenizer training, we use Libri Speech (Panayotov et al., 2015) dataset. ... For zero-shot TTS, we train AR and NAR models on the English subset of Multilingual Libri Speech (Pratap et al., 2020) dataset
Dataset Splits Yes We train the downstream model on Libri Speech train-clean100 subset and use dev-clean subset for estimating mutual information.
Hardware Specification Yes For Speech Tokenizer, the model are trained on 2 A800 GPUS
Software Dependencies No The paper describes model architectures and training setups but does not provide specific software dependencies with version numbers.
Experiment Setup Yes For downstream model training, we configure the training setup with a batch size of 32, a learning rate of 1e-4, and a total of 200k global steps. For Speech Tokenizer, the model are trained on 2 A800 GPUS for 20 epochs with maximum learning rate of 4e-4 and batch size of 20 per GPU. For Unified Speech Language Model, both AR and NAR models are trained on 8 A800 GPUS for 500k steps with maximum learning rate of 5e-4. The AR model is trained with batch size of 7500 tokens per GPU, and the NAR model is trained with batch size of 5000 tokens per GPU.