Are Sixteen Heads Really Better than One?

Authors: Paul Michel, Omer Levy, Graham Neubig

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. We perform a series of experiments in which we remove one or more attention heads from a given architecture at test time, and measure the performance difference.
Researcher Affiliation Collaboration Paul Michel Language Technologies Institute Carnegie Mellon University Pittsburgh, PA pmichel1@cs.cmu.edu Omer Levy Facebook Artificial Intelligence Research Seattle, WA omerlevy@fb.com Graham Neubig Language Technologies Institute Carnegie Mellon University Pittsburgh, PA gneubig@cs.cmu.edu
Pseudocode No The paper does not contain any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code to replicate our experiments is provided at https://github.com/pmichel31415/ are-16-heads-really-better-than-1
Open Datasets Yes WMT This is the original large transformer architecture from Vaswani et al. 2017 with 6 layers and 16 heads per layer, trained on the WMT2014 English to French corpus. We use the pre-trained base-uncased model of Devlin et al. 2018 with 12 layers and 12 attention heads which we fine-tune and evaluate on Multi NLI (Williams et al., 2018).
Dataset Splits Yes For this purpose, we select the best head for each layer on a validation set (newstest2013 for WMT and a 5,000-sized randomly selected subset of the training set of MNLI for BERT) and evaluate the model s performance on a test set (newstest2014 for WMT and the MNLI-matched validation set for BERT).
Hardware Specification Yes Experiments were conducted on two different machines, both equipped with Ge Force GTX 1080Ti GPUs.
Software Dependencies No The paper mentions software like Moses and compare-mt, and refers to PyTorch/fairseq usage, but it does not provide specific version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup No The paper mentions architectural details (e.g., number of layers and heads) and using pre-trained models, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, optimizer settings) or detailed training configurations in the main text.