reproducibilityindex.ai

Are Sixteen Heads Really Better than One?

Authors: Paul Michel, Omer Levy, Graham Neubig

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without signiﬁcantly impacting performance. We perform a series of experiments in which we remove one or more attention heads from a given architecture at test time, and measure the performance difference.
Researcher Affiliation	Collaboration	Paul Michel Language Technologies Institute Carnegie Mellon University Pittsburgh, PA pmichel1@cs.cmu.edu Omer Levy Facebook Artiﬁcial Intelligence Research Seattle, WA omerlevy@fb.com Graham Neubig Language Technologies Institute Carnegie Mellon University Pittsburgh, PA gneubig@cs.cmu.edu
Pseudocode	No	The paper does not contain any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Code to replicate our experiments is provided at https://github.com/pmichel31415/ are-16-heads-really-better-than-1
Open Datasets	Yes	WMT This is the original large transformer architecture from Vaswani et al. 2017 with 6 layers and 16 heads per layer, trained on the WMT2014 English to French corpus. We use the pre-trained base-uncased model of Devlin et al. 2018 with 12 layers and 12 attention heads which we ﬁne-tune and evaluate on Multi NLI (Williams et al., 2018).
Dataset Splits	Yes	For this purpose, we select the best head for each layer on a validation set (newstest2013 for WMT and a 5,000-sized randomly selected subset of the training set of MNLI for BERT) and evaluate the model s performance on a test set (newstest2014 for WMT and the MNLI-matched validation set for BERT).
Hardware Specification	Yes	Experiments were conducted on two different machines, both equipped with Ge Force GTX 1080Ti GPUs.
Software Dependencies	No	The paper mentions software like Moses and compare-mt, and refers to PyTorch/fairseq usage, but it does not provide specific version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup	No	The paper mentions architectural details (e.g., number of layers and heads) and using pre-trained models, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, optimizer settings) or detailed training configurations in the main text.