Headless Language Models: Learning without Predicting with Contrastive Weight Tying
Authors: Nathan Godey, Éric Villemonte de la Clergerie, Benoît Sagot
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that our approach outperforms usual language modeling counterparts in several aspects and by substantial margins. Moreover, given the same amount of training tokens, headless language models (HLMs) significantly outperform their classical counterparts on downstream tasks, as shown by a 2.7 gain in LAMBADA accuracy for our headless generative model. Finally, given similar compute budgets, HLMs bring substantial gains for NLU tasks, with our BERT reproduction scoring 1.6 points above its classical counterpart on the GLUE benchmark. |
| Researcher Affiliation | Academia | Nathan Godey1,2 Éric de la Clergerie1 Benoît Sagot1 1Inria, Paris, France 2Sorbonne Université, Paris, France |
| Pseudocode | Yes | Figure 9: Py Torch implementation of the Contrastive Weight Tying loss. Figure 10: Py Torch implementation of the computation of the training loss for headless causal LMs. |
| Open Source Code | Yes | Our pretraining and fine-tuning code is published in https://github.com/Nathan Godey/headless-lm |
| Open Datasets | Yes | We pretrain BERT-base architectures (110M parameters) for English on the Open Web Text2 dataset extracted from The Pile (Gao et al., 2020). We pretrain small multilingual MLMs... on the multilingual Wikipedia dataset. |
| Dataset Splits | Yes | We evaluate on the GLUE benchmark, where we exclude the RTE dataset due to high standard deviations in the obtained scores. We fine-tune our models for 10 epochs on every dataset, and compute validation metrics once every fine-tuning epoch. In Table 1, we compare our headless MLM with the classical MLM on the GLUE benchmark. We display evaluations at similar amounts of tokens seen during pre-training, and at similar training durations on the same hardware. Table 1: Results of Masked Language Models (MLMs) on the dev sets of the GLUE benchmark. |
| Hardware Specification | Yes | In Figure 2, we provide a preliminary empirical analysis of the speed and memory improvements when training a BERT-base model on a single RTX 8000 GPU. We pretrain all models using 8 A100 GPUs, with a budget of roughly 1,000 hours each. We train on 143,000 batches of 1,024 sequences of length 2,048 split over 16 V100 GPUs. |
| Software Dependencies | No | The paper mentions software like Hugging Face s implementation for the Transformers blocks, x Formers (Lefaudeux et al., 2022), and PyTorch (implied by PyTorch implementation figures), but does not specify their version numbers. |
| Experiment Setup | Yes | We mostly use hyperparameters from BERT (Devlin et al., 2019), although we remove the NSP objective as in RoBERTa (Liu et al., 2019). For the sake of simplicity, we use a sequence length of 128 for the whole training. We give a detailed overview of the hyperparameters in Appendix D.1. For the vanilla MLM, we set a micro-batch size of 32 for each A100 GPU, then accumulate to the original 256 batch size at optimization level, and train on 1 million batches. For our headless approach, we observed that we could remain within compute budget when using a micro-batch size of 64. Hence, we use an effective batch size of 512 for the headless MLM (HMLM). |