reproducibilityindex.ai

Supervising Model Attention with Human Explanations for Robust Natural Language Inference

Authors: Joe Stacey, Yonatan Belinkov, Marek Rei11349-11357

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that the in-distribution improvements of this method are also accompanied by out-of-distribution improvements, with the supervised models learning from features that generalise better to other NLI datasets. The experiments show that supervising the attention patterns of BERT based on human explanations simultaneously improves both in-distribution and out-of-distribution NLI performance (Table 1).
Researcher Affiliation	Academia	1Imperial College London 2Technion Israel Institute of Technology
Pseudocode	No	The paper includes mathematical equations for loss calculation and attention weights (e.g., Loss Total = Loss NLI + λ), and diagrams illustrating attention supervision (Figure 2), but no formal pseudocode or algorithm blocks are provided.
Open Source Code	Yes	1https://github.com/joestacey/NLI with a human touch
Open Datasets	Yes	Using natural language explanations, we supervise the model s attention weights to encourage more attention to be paid to the words present in the explanations, signiﬁcantly improving model performance. Our experiments show that the in-distribution improvements of this method are also accompanied by out-of-distribution improvements, with the supervised models learning from features that generalise better to other NLI datasets. (mentions using e-SNLI dataset (Camburu et al. 2018))
Dataset Splits	Yes	λ was chosen based on performance on the validation set, trying values in the range [0.2, 1.8] at increments of 0.2. and The robustness of the model is assessed by signiﬁcance testing on the Multi NLI matched and mismatched validation sets (Williams, Nangia, and Bowman 2018)
Hardware Specification	No	The paper discusses the models used (BERT, DeBERTa) and datasets, but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) on which the experiments were run.
Software Dependencies	No	The paper mentions various models and tools such as BERT, DeBERTa, GPT2, RoBERTa, and SpaCy (for POS tagging), but it does not specify any version numbers for these software components or any other libraries like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	λ was chosen based on performance on the validation set, trying values in the range [0.2, 1.8] at increments of 0.2. For our BERT model the best performing λ is 1.0, equally weighting the two loss terms, whereas for De BERTa this value was 0.8.