Low-Rank Bottleneck in Multi-head Attention Models

Authors: Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we identify one of the important factors contributing to the large embedding size requirement. In particular, our analysis highlights that the scaling between the number of heads and the size of each head in the current architecture gives rise to a lowrank bottleneck in attention heads, causing this limitation. We further validate this in our experiments. As a solution we propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power. We empirically show that this allows us to train models with a relatively smaller embedding dimension and with better performance scaling. and 4. Experiments
Researcher Affiliation Collaboration 1Google Research New York 2Massachusetts Institute of Technology. Correspondence to: Srinadh Bhojanapalli <bsrinadh@google.com>, Chulhee Yun <chulheey@mit.edu>.
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code No We follow the same experimental setup for both pre-training and fine-tuning as BERT (Devlin et al., 2018), and use their codebase1. 1https://github.com/google-research/bert. The paper refers to BERT's codebase, not code specifically released by the authors for their described methodology.
Open Datasets Yes For the language modeling task we use the one billion word benchmark dataset (LM1B) (Chelba et al., 2013). and Multi-Genre Natural Language Inference (MNLI) is a sentence level entailment task... (Williams et al., 2018). and Stanford Question Answering Dataset (SQu AD) is a question answering dataset... (Rajpurkar et al., 2016). and For pre-training we use English Wikipedia and Books Corpus dataset (Zhu et al., 2015).
Dataset Splits Yes All results in this section are reported on the Dev set, which has not been used in any experimental choices in this paper.
Hardware Specification No No specific hardware details (like GPU/CPU models, memory, or TPU versions) used for experiments are mentioned in the paper.
Software Dependencies No We train a 6 layer Transformer model with the ADAM optimizer using the tensor2tensor library (Vaswani et al., 2018). No specific version number for the tensor2tensor library or any other software dependency is provided.
Experiment Setup Yes We use a sub-word tokenizer with 32k vocab and cap the input to 256 sequence length. and Our proposed modification introduces head size dp as a new model hyper-parameter. We choose head size to be 128 for our BERT experiments, as most of the pre-training is done with 128 sequence length data. and We train the fixed head size models with a fixed embedding size of 256 and a head size of 32, with an increasing number of heads from 4 to 70, while matching the number of parameters. and We compare it with the fixed head size model, with an embedding size of 512 and a head size of 128, with an increasing number of heads from 8 to 32.