SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models
Authors: Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS |
| Researcher Affiliation | Academia | Xin Zhang , Dong Zhang , Shimin Li, Yaqian Zhou , Xipeng Qiu School of Computer Science, Fudan University Shanghai Key Laboratory of Intelligent Information Processing, Fudan University {xin_zhang22,dongzhang22}@m.fudan.edu.cn {smli20,zhouyaqian,xpqiu}@fudan.edu.cn |
| Pseudocode | No | The paper describes its methods and models in text and figures, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/. |
| Open Datasets | Yes | Datasets For Speech Tokenizer training, we use Libri Speech (Panayotov et al., 2015) dataset. ... For zero-shot TTS, we train AR and NAR models on the English subset of Multilingual Libri Speech (Pratap et al., 2020) dataset |
| Dataset Splits | Yes | We train the downstream model on Libri Speech train-clean100 subset and use dev-clean subset for estimating mutual information. |
| Hardware Specification | Yes | For Speech Tokenizer, the model are trained on 2 A800 GPUS |
| Software Dependencies | No | The paper describes model architectures and training setups but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | For downstream model training, we configure the training setup with a batch size of 32, a learning rate of 1e-4, and a total of 200k global steps. For Speech Tokenizer, the model are trained on 2 A800 GPUS for 20 epochs with maximum learning rate of 4e-4 and batch size of 20 per GPU. For Unified Speech Language Model, both AR and NAR models are trained on 8 A800 GPUS for 500k steps with maximum learning rate of 5e-4. The AR model is trained with batch size of 7500 tokens per GPU, and the NAR model is trained with batch size of 5000 tokens per GPU. |