reproducibilityindex.ai

Debiasing NLU Models via Causal Intervention and Counterfactual Reasoning

Authors: Bing Tian, Yixin Cao, Yong Zhang, Chunxiao Xing11376-11384

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on large-scale natural language inference and fact verification benchmarks, evaluating on bias sensitive datasets that are specifically designed to assess the robustness of models against known biases in the training data. Experimental results show that our proposed debiasing framework outperforms previous stateof-the-art debiasing methods while maintaining the original in-distribution performance.
Researcher Affiliation	Academia	1DCST, BNRist, RIIT, Institute of Internet Industry, Tsinghua University, Beijing, China 2Singapore Management University
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes the method in prose and through mathematical equations and diagrams, but not in an algorithm format.
Open Source Code	No	The paper does not include an unambiguous statement from the authors that they are releasing their source code, nor does it provide a direct link to a code repository for their methodology. Footnote 4 links to a dataset repository, not the authors' code.
Open Datasets	Yes	For natural language inference, we train models on the SNLI dataset (Bowman et al. 2015)... For fact verification, we use the training dataset provided by the FEVER challenge (Thorne et al. 2018)... Schuster et al. (2019) introduced a new evaluation set fever-symmetric dataset...
Dataset Splits	Yes	For natural language inference, we train models on the SNLI dataset (Bowman et al. 2015), which is known to contain significant annotation artifacts. The dataset consists of pairs of premise and hypothesis sentences along with their inference labels. We evaluate the models on SNLI-hard (Gururangan et al. 2018), a subset of SNLI test set where a hypothesisonly model cannot correctly predict the labels. For fact verification, we use the training dataset provided by the FEVER challenge (Thorne et al. 2018). ... Schuster et al. (2019) introduced a new evaluation set feversymmetric dataset ... We evaluate the models on the both versions (version 1 and 2) of their test sets.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU/CPU models, memory, or specific cloud instance types.
Software Dependencies	No	The paper mentions using "off-the-shelf uncased BERT (Devlin et al. 2019) implementation of (Wolf et al. 2019)", referring to BERT and Hugging Face's Transformers library, but it does not specify version numbers for these software components.
Experiment Setup	Yes	We fine-tune all models using BERT for 3 epochs and use the default parameters and default learning rate of 1e 5... The premise/evidence-only model predicts the labels using only premises/evidences as input, which is a shallow nonlinear classifier with 768, 384 and 192 hidden units with Tanh nonlinearity...