Are Sixteen Heads Really Better than One?
Authors: Paul Michel, Omer Levy, Graham Neubig
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. We perform a series of experiments in which we remove one or more attention heads from a given architecture at test time, and measure the performance difference. |
| Researcher Affiliation | Collaboration | Paul Michel Language Technologies Institute Carnegie Mellon University Pittsburgh, PA pmichel1@cs.cmu.edu Omer Levy Facebook Artificial Intelligence Research Seattle, WA omerlevy@fb.com Graham Neubig Language Technologies Institute Carnegie Mellon University Pittsburgh, PA gneubig@cs.cmu.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code to replicate our experiments is provided at https://github.com/pmichel31415/ are-16-heads-really-better-than-1 |
| Open Datasets | Yes | WMT This is the original large transformer architecture from Vaswani et al. 2017 with 6 layers and 16 heads per layer, trained on the WMT2014 English to French corpus. We use the pre-trained base-uncased model of Devlin et al. 2018 with 12 layers and 12 attention heads which we fine-tune and evaluate on Multi NLI (Williams et al., 2018). |
| Dataset Splits | Yes | For this purpose, we select the best head for each layer on a validation set (newstest2013 for WMT and a 5,000-sized randomly selected subset of the training set of MNLI for BERT) and evaluate the model s performance on a test set (newstest2014 for WMT and the MNLI-matched validation set for BERT). |
| Hardware Specification | Yes | Experiments were conducted on two different machines, both equipped with Ge Force GTX 1080Ti GPUs. |
| Software Dependencies | No | The paper mentions software like Moses and compare-mt, and refers to PyTorch/fairseq usage, but it does not provide specific version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | No | The paper mentions architectural details (e.g., number of layers and heads) and using pre-trained models, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, optimizer settings) or detailed training configurations in the main text. |