reproducibilityindex.ai

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

Authors: Yaru Hao, Li Dong, Furu Wei, Ke Xu12963-12971

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We take BERT as an example to conduct extensive studies. For example, on the MNLI dataset, adding one adversarial pattern into the premise can drop the accuracy of entailment from 82.87% to 0.8%.
Researcher Affiliation	Collaboration	1 Beihang University 2 Microsoft Research {haoyaru@,kexu@nlsde.}buaa.edu.cn {lidong1,fuwei}@microsoft.com
Pseudocode	Yes	Algorithm 1 Attribution Tree Construction
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the proposed ATTATTR method, nor does it provide a link to a code repository.
Open Datasets	Yes	We perform BERT fine-tuning and conduct experiments on four classiﬁcation datasets. MNLI (Williams, Nangia, and Bowman 2018)... RTE (Dagan, Glickman, and Magnini 2006; Bar-Haim et al. 2006; Giampiccolo et al. 2007; Bentivogli et al. 2009)... SST-2 (Socher et al. 2013)... MRPC (Dolan and Brockett 2005)...
Dataset Splits	Yes	We use the same data split as in (Wang et al. 2019). We calculate Ih on 200 examples sampled from the held-out dataset.
Hardware Specification	Yes	For a sequence of 128 tokens, the attribution time of the BERT-base model takes about one second on an Nvidia-v100 GPU card.
Software Dependencies	No	The paper mentions using 'BERT-base-cased' and fine-tuning settings suggested in 'Devlin et al. (2019)', but does not provide specific software version numbers for libraries or environments like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	When fine-tuning BERT, we follow the settings and the hyper-parameters suggested in (Devlin et al. 2019). In our experiments, we set m to 20, which performs well in practice. We set τ = 0.4 for layers l < 12. ... we set τ to 0 for the last layer.