reproducibilityindex.ai

LIREx: Augmenting Language Inference with Relevant Explanations

Authors: Xinyan Zhao, V.G.Vinod Vydiswaran14532-14539

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When evaluated on the standardized SNLI data set, LIREx achieved an accuracy of 91.87%, an improvement of 0.32 over the baseline and matching the best-reported performance on the data set. It also achieves significantly better performance than previous studies when transferred to the out-of-domain Multi NLI data set. Qualitative analysis shows that LIREx generates flexible, faithful, and relevant NLEs that allow the model to be more robust to spurious explanations.
Researcher Affiliation	Academia	Xinyan Zhao,1 V.G.Vinod Vydiswaran2,1 1School of Information; 2Department of Learning Health Sciences University of Michigan, Ann Arbor, Michigan 48109 USA {zhaoxy, vgvinodv}@umich.edu
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor are there any structured steps formatted like code.
Open Source Code	Yes	The code is available at https://github.com/zhaoxy92/LIREx.
Open Datasets	Yes	The proposed framework is evaluated on two widely-used corpora for language inference SNLI (Bowman et al. 2015) and Multi NLI (Williams, Nangia, and Bowman 2017). SNLI is a balanced collection of P-H annotated pairs with labels from {entailment, neutral, contradiction}. It consists of about 550K, 10K and 10K examples for train, development, and test set, respectively. (Camburu et al. 2018) recently expanded this data set to e-SNLI in which each data instance is also annotated with explanations.
Dataset Splits	Yes	SNLI is a balanced collection of P-H annotated pairs with labels from {entailment, neutral, contradiction}. It consists of about 550K, 10K and 10K examples for train, development, and test set, respectively. The Multi NLI data set differs from the SNLI data set in that it covers a range of genres of spoken and written text. It contains 433K P-H pairs annotated the same way as SNLI. The evaluation set is divided into Dev-match set (10K) and Dev-mismatch set (10K).
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU models, CPU types, or cloud computing instances with their specifications.
Software Dependencies	No	The paper mentions using a 'Ro BERTabase model' and 'GPT2' and cites their respective papers, but it does not provide specific version numbers for these software components or any other ancillary libraries/frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	The overall workflow of LIREx is shown in Figure 2. Given a premise-hypothesis (P-H) pair, a label-aware rationalizer predicts rationales by taking as input a triplet (P, H, x; x {entail, neutral, contradict}) and outputs a rationalized P-H pair, (P, Hx). Next, the NLE generator generates explanations (Ex) for each rationalized P-H pair. Then, the explanations are combined with the original P-H pair as input to the instance selector and inference model to predict the final label. Each component is described below. ... We model NLE generation as a text generation task, in which we leverage GPT2 (Radford et al. 2019)... We choose GPT2-medium... To inform the generator about the rationales in hypothesis, we highlight rationale tokens by surrounding them with the square brackets [] . The generator is finetuned by modeling the text input as a whole. ... We initialize the selector S( ) with a Ro BERTabase model and use the representation of the ﬁrst token, h0, as the sequence representation. On top of this, an output layer of linear transformation and activation, Tanh(U1h0)U2, is applied for prediction. ... we use a probability-oriented training objective, soft cross-entropy loss, to improve the model robustness towards noisy input: CEsoft(p, ˆp) = X l {e,n,c} ˆpl log pl (3)