Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Authors: Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.
Researcher Affiliation Industry Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort Qualcomm AI Research Amsterdam, The Netherlands {ybond, markusn, tijmen}@qti.qualcomm.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks clearly labeled as "Pseudocode" or "Algorithm".
Open Source Code Yes Our source code is available at https: //github.com/qualcomm-ai-research/outlier-free-transformers.
Open Datasets Yes We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers. ... We experiment with BERT-base-uncased... fine-tune it on MNLI dataset from the well-known GLUE benchmark... We experiment with a 125M sized variant of OPT... We use a pre-trained checkpoint following our experimental setup from Section 5 for Vision Transformer [15] trained on Image Net [52].
Dataset Splits Yes To identify the outlier dimensions, we pass the MNLI-m validation set through the network... We evaluate on Wikipedia validation set and report the MLM perplexity. ... We evaluate on Wikipedia validation set and report the CLM perplexity. ... We report top-1 accuracy on the validation set of Image Net.
Hardware Specification Yes Due to compute constraints, we train the model on the same dataset that was used for BERT pre-training (Book Corpus + Wikipedia) with a maximum sequence length of 512 and batch size of 192, so that we can perform pretraining on a single A100 80GB GPU.
Software Dependencies No The paper states: "We implement our methods in Py Torch [48] and use training and evaluation pipelines from Hugging Face libraries [20, 34, 65]" and "We also use FP16 mixed-precision from Hugging Face Accelerate library [20]". However, it does not provide specific version numbers for PyTorch, Hugging Face libraries, or Accelerate, which are necessary for full reproducibility.
Experiment Setup Yes All detailed hyperparameters of our experiments are in Appendix C. ... We fine-tune for 3 epochs using Adam [29] with a batch size of 16 and no weight decay. The learning rate is initially set to its maximum value of of 2 10 5 and is linearly decayed to zero by the end of fine-tuning. ... We train with a batch size of 256 sequences for 106 steps, using Adam W optimizer [39] with the maximum learning rate of 10 4, learning rate warm up over the first 104 steps, following by a linear decay to zero by the end of training. We use L2 weight decay of 0.01, L2 gradient norm clipping of 1.0, and dropout probability of 0.1 on all layers.