reproducibilityindex.ai

On Identifiability in Transformers

Authors: Gino Brunner, Yang Liu, Damian Pascual, Oliver Richter, Massimiliano Ciaramita, Roger Wattenhofer

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we delve deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, we study the identiﬁability of attention weights and token embeddings, and the aggregation of context into hidden tokens. We show that, for sequences longer than the attention head dimension, attention weights are not identiﬁable. We propose effective attention as a complementary tool for improving explanatory interpretations based on attention. Furthermore, we show that input tokens retain to a large degree their identity across the model. We also ﬁnd evidence suggesting that identity information is mainly encoded in the angle of the embeddings and gradually decreases with depth. Finally, we demonstrate strong mixing of input information in the generation of contextual embeddings by means of a novel quantiﬁcation method based on gradient attribution.
Researcher Affiliation	Collaboration	Gino Brunner1 , Yang Liu2 , Dami an Pascual1 , Oliver Richter1, Massimiliano Ciaramita3, Roger Wattenhofer1 Departments of 1Electrical Engineering and Information Technology, 2Computer Science ETH Zurich, Switzerland 3Google Research, Zurich, Switzerland 1{brunnegi,dpascual,richtero,wattenhofer}@ethz.ch, 2liu.yang@alumni.ethz.ch 3massi@google.com
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper refers to the pre-trained uncased BERT-Base model (Devlin et al., 2019) from https://github.com/google-research/bert and mentions using code from Clark et al. (2019) for some figures, but does not provide specific open-source code for the methodologies proposed in this paper.
Open Datasets	Yes	For the experiments in this and subsequent sections we use the development dataset from the Microsoft Research Paraphrase Corpus (MRPC) dataset (Dolan & Brockett, 2005), while in Appendix D we provide results on two additional datasets. ... The Corpus of Linguistic Acceptability (Co LA) (Warstadt et al., 2018), and the matched Multi-Genre Natural Language Inference corpus (MNLI-matched) (Williams et al., 2018).
Dataset Splits	Yes	We use 10-fold cross validation with 70/15/15 train/validation/test splits per fold and ensure that tokens from the same sentence are not split across sets. The validation set is used for early stopping. See Appendix B.1 for details.
Hardware Specification	Yes	This research was supported with Cloud TPUs from Google s Tensor Flow Research Cloud (TFRC).
Software Dependencies	No	The paper mentions using the ADAM optimizer (Kingma & Ba, 2015), Glorot Uniform initializer (Glorot & Bengio, 2010), and gelu activation function (Hendrycks & Gimpel, 2016), but does not provide specific version numbers for software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	The linear perceptron and MLP are both trained by either minimizing the L2 or cosine distance loss using the ADAM optimizer (Kingma & Ba, 2015) with a learning rate of α = 0.0001, β1 = 0.9 and β2 = 0.999. We use a batch size of 256. We monitor performance on the validation set and stop training if there is no improvement for 20 epochs. The input and output dimension of the models is d = 768; the dimension of the contextual word embeddings. For both models we performed a learning rate search over the values α [0.003, 0.001, 0.0003, 0.0001, 0.00003, 0.00001, 0.000003]. The weights are initialized with the Glorot Uniform initializer (Glorot & Bengio, 2010). The MLP has one hidden layer with 1000 neurons and uses the gelu activation function (Hendrycks & Gimpel, 2016).