reproducibilityindex.ai

Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI

Authors: Suzanna Sia, Anton Belyy, Amjad Almahairi, Madian Khabsa, Luke Zettlemoyer, Lambert Mathias

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the method through three sets of experiments. i) Evaluating the quality of generated counterfactual hypotheses from a few-shot generator Hmodel (Section 3.2). ii) Evaluating the proposed metric (FTC) on gold-model agreement of human generated xcf, and comparing this to existing faithfulness metrics in the literature (Section 3.3). iii) Studying the sensitivity of our proposed approach compared to other metrics given pathological explanations (Section 3.4).
Researcher Affiliation	Collaboration	1 Johns Hopkins University 2 Meta AI Research
Pseudocode	No	The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	code is available on request.
Open Datasets	Yes	e-SNLI (Camburu et al. 2018) and e-SNLI-VE (Do et al. 2020) which are the only explainable logical entailment datasets available at point of writing (Wiegreffe and Marasovi c 2021).
Dataset Splits	Yes	We randomly sample 300 examples from the validation set (100 each for E, C, N)
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments.
Software Dependencies	No	The paper mentions models like CLIP, GPT2, GPT-Neo but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	For the task model f, we adopt a state-of-art multimodal model, CLIP (Radford et al. 2021), and fine-tune a 2-layer MLP to train a predictor f(u, x) y. (...) For the counterfactual hypothesis generator, Hmodel we adopt a pretrained GPT2-XL and GPT-Neo1.3B and 2.7B (Black et al. 2021) without further fine-tuning, and apply only handwritten prompts. Prompt examples were randomly sampled from the training set and we used 20 prompts for each label.