On Identifiability in Transformers

Authors: Gino Brunner, Yang Liu, Damian Pascual, Oliver Richter, Massimiliano Ciaramita, Roger Wattenhofer

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we delve deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, we study the identifiability of attention weights and token embeddings, and the aggregation of context into hidden tokens. We show that, for sequences longer than the attention head dimension, attention weights are not identifiable. We propose effective attention as a complementary tool for improving explanatory interpretations based on attention. Furthermore, we show that input tokens retain to a large degree their identity across the model. We also find evidence suggesting that identity information is mainly encoded in the angle of the embeddings and gradually decreases with depth. Finally, we demonstrate strong mixing of input information in the generation of contextual embeddings by means of a novel quantification method based on gradient attribution.
Researcher Affiliation Collaboration Gino Brunner1 , Yang Liu2 , Dami an Pascual1 , Oliver Richter1, Massimiliano Ciaramita3, Roger Wattenhofer1 Departments of 1Electrical Engineering and Information Technology, 2Computer Science ETH Zurich, Switzerland 3Google Research, Zurich, Switzerland 1{brunnegi,dpascual,richtero,wattenhofer}@ethz.ch, 2liu.yang@alumni.ethz.ch 3massi@google.com
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper refers to the pre-trained uncased BERT-Base model (Devlin et al., 2019) from https://github.com/google-research/bert and mentions using code from Clark et al. (2019) for some figures, but does not provide specific open-source code for the methodologies proposed in *this* paper.
Open Datasets Yes For the experiments in this and subsequent sections we use the development dataset from the Microsoft Research Paraphrase Corpus (MRPC) dataset (Dolan & Brockett, 2005), while in Appendix D we provide results on two additional datasets. ... The Corpus of Linguistic Acceptability (Co LA) (Warstadt et al., 2018), and the matched Multi-Genre Natural Language Inference corpus (MNLI-matched) (Williams et al., 2018).
Dataset Splits Yes We use 10-fold cross validation with 70/15/15 train/validation/test splits per fold and ensure that tokens from the same sentence are not split across sets. The validation set is used for early stopping. See Appendix B.1 for details.
Hardware Specification Yes This research was supported with Cloud TPUs from Google s Tensor Flow Research Cloud (TFRC).
Software Dependencies No The paper mentions using the ADAM optimizer (Kingma & Ba, 2015), Glorot Uniform initializer (Glorot & Bengio, 2010), and gelu activation function (Hendrycks & Gimpel, 2016), but does not provide specific version numbers for software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes The linear perceptron and MLP are both trained by either minimizing the L2 or cosine distance loss using the ADAM optimizer (Kingma & Ba, 2015) with a learning rate of α = 0.0001, β1 = 0.9 and β2 = 0.999. We use a batch size of 256. We monitor performance on the validation set and stop training if there is no improvement for 20 epochs. The input and output dimension of the models is d = 768; the dimension of the contextual word embeddings. For both models we performed a learning rate search over the values α [0.003, 0.001, 0.0003, 0.0001, 0.00003, 0.00001, 0.000003]. The weights are initialized with the Glorot Uniform initializer (Glorot & Bengio, 2010). The MLP has one hidden layer with 1000 neurons and uses the gelu activation function (Hendrycks & Gimpel, 2016).