Headless Language Models: Learning without Predicting with Contrastive Weight Tying

Authors: Nathan Godey, Éric Villemonte de la Clergerie, Benoît Sagot

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that our approach outperforms usual language modeling counterparts in several aspects and by substantial margins. Moreover, given the same amount of training tokens, headless language models (HLMs) significantly outperform their classical counterparts on downstream tasks, as shown by a 2.7 gain in LAMBADA accuracy for our headless generative model. Finally, given similar compute budgets, HLMs bring substantial gains for NLU tasks, with our BERT reproduction scoring 1.6 points above its classical counterpart on the GLUE benchmark.
Researcher Affiliation Academia Nathan Godey1,2 Éric de la Clergerie1 Benoît Sagot1 1Inria, Paris, France 2Sorbonne Université, Paris, France
Pseudocode Yes Figure 9: Py Torch implementation of the Contrastive Weight Tying loss. Figure 10: Py Torch implementation of the computation of the training loss for headless causal LMs.
Open Source Code Yes Our pretraining and fine-tuning code is published in https://github.com/Nathan Godey/headless-lm
Open Datasets Yes We pretrain BERT-base architectures (110M parameters) for English on the Open Web Text2 dataset extracted from The Pile (Gao et al., 2020). We pretrain small multilingual MLMs... on the multilingual Wikipedia dataset.
Dataset Splits Yes We evaluate on the GLUE benchmark, where we exclude the RTE dataset due to high standard deviations in the obtained scores. We fine-tune our models for 10 epochs on every dataset, and compute validation metrics once every fine-tuning epoch. In Table 1, we compare our headless MLM with the classical MLM on the GLUE benchmark. We display evaluations at similar amounts of tokens seen during pre-training, and at similar training durations on the same hardware. Table 1: Results of Masked Language Models (MLMs) on the dev sets of the GLUE benchmark.
Hardware Specification Yes In Figure 2, we provide a preliminary empirical analysis of the speed and memory improvements when training a BERT-base model on a single RTX 8000 GPU. We pretrain all models using 8 A100 GPUs, with a budget of roughly 1,000 hours each. We train on 143,000 batches of 1,024 sequences of length 2,048 split over 16 V100 GPUs.
Software Dependencies No The paper mentions software like Hugging Face s implementation for the Transformers blocks, x Formers (Lefaudeux et al., 2022), and PyTorch (implied by PyTorch implementation figures), but does not specify their version numbers.
Experiment Setup Yes We mostly use hyperparameters from BERT (Devlin et al., 2019), although we remove the NSP objective as in RoBERTa (Liu et al., 2019). For the sake of simplicity, we use a sequence length of 128 for the whole training. We give a detailed overview of the hyperparameters in Appendix D.1. For the vanilla MLM, we set a micro-batch size of 32 for each A100 GPU, then accumulate to the original 256 batch size at optimization level, and train on 1 million batches. For our headless approach, we observed that we could remain within compute budget when using a micro-batch size of 64. Hence, we use an effective batch size of 512 for the headless MLM (HMLM).