Textually Pretrained Speech Language Models
Authors: Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis CONNEAU, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, Yossi Adi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show using both automatic and human evaluations that TWIST outperforms a cold-start Speech LM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing Speech LMs. |
| Researcher Affiliation | Collaboration | FAIR Team, Meta Open AI The Hebrew University of Jerusalem michael.hassid@mail.huji.ac.il |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We make speech samples, code and models publicly available.2https://pages.cs.huji.ac.il/adiyoss-lab/twist/ |
| Open Datasets | Yes | All Speech LMs are optimized using a collection of publicly available academic speech datasets: Libri Speech (LS) [Panayotov et al., 2015], Libri Light (LL) [Kahn et al., 2020], Spotify podcasts [Clifton et al., 2020], People dataset [Galvez et al., 2021], and Vox Populi [Wang et al., 2021a]. |
| Dataset Splits | Yes | In cases where no pre-defined validation and test sets are available, we randomly sample 2% of the data serving as the validation set and an additional 2% for the test set. |
| Hardware Specification | No | The paper mentions using '8 GPUs for training, except for the 1.3B models, which use 32 GPUs' but does not specify the exact GPU models (e.g., NVIDIA A100, Tesla V100) or any other specific hardware details like CPU models or memory. |
| Software Dependencies | No | The paper mentions specific models like 'Whisper small' and 'LLa MA-7B' and libraries like 'textless-lib', but it does not provide a reproducible description of ancillary software, such as programming language versions (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions with specific version numbers. |
| Experiment Setup | Yes | All LM models are trained with a batch size of 64, where each sample is bounded for 25 seconds and 704 tokens. The models are trained for 400k steps ( 1.2 epochs), using an inverse-sqrt scheduler, 100 warmup steps and w ADAM as the optimization algorithm. We also tune the learning rate per scenario, i.e: using/not-using pretrained LM, we end up with a maximal learning rate of 4e-4/8e-5 and final learning rate of 8e-5/2.5e-5, respectively. As for the LLa MA-7B/13B model, we use the same configuration except the following: cosine learning rate schedule, 500 warmup steps, a maximum learning rate of 1e-4, a final rate of 1e-5, batch size of 1024 over 32 GPUs for 75k steps ( 4 epochs). |