Supervising Model Attention with Human Explanations for Robust Natural Language Inference

Authors: Joe Stacey, Yonatan Belinkov, Marek Rei11349-11357

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that the in-distribution improvements of this method are also accompanied by out-of-distribution improvements, with the supervised models learning from features that generalise better to other NLI datasets. The experiments show that supervising the attention patterns of BERT based on human explanations simultaneously improves both in-distribution and out-of-distribution NLI performance (Table 1).
Researcher Affiliation Academia 1Imperial College London 2Technion Israel Institute of Technology
Pseudocode No The paper includes mathematical equations for loss calculation and attention weights (e.g., Loss Total = Loss NLI + λ), and diagrams illustrating attention supervision (Figure 2), but no formal pseudocode or algorithm blocks are provided.
Open Source Code Yes 1https://github.com/joestacey/NLI with a human touch
Open Datasets Yes Using natural language explanations, we supervise the model s attention weights to encourage more attention to be paid to the words present in the explanations, significantly improving model performance. Our experiments show that the in-distribution improvements of this method are also accompanied by out-of-distribution improvements, with the supervised models learning from features that generalise better to other NLI datasets. (mentions using e-SNLI dataset (Camburu et al. 2018))
Dataset Splits Yes λ was chosen based on performance on the validation set, trying values in the range [0.2, 1.8] at increments of 0.2. and The robustness of the model is assessed by significance testing on the Multi NLI matched and mismatched validation sets (Williams, Nangia, and Bowman 2018)
Hardware Specification No The paper discusses the models used (BERT, DeBERTa) and datasets, but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) on which the experiments were run.
Software Dependencies No The paper mentions various models and tools such as BERT, DeBERTa, GPT2, RoBERTa, and SpaCy (for POS tagging), but it does not specify any version numbers for these software components or any other libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes λ was chosen based on performance on the validation set, trying values in the range [0.2, 1.8] at increments of 0.2. For our BERT model the best performing λ is 1.0, equally weighting the two loss terms, whereas for De BERTa this value was 0.8.