reproducibilityindex.ai

Feature-Level Debiased Natural Language Understanding

Authors: Yougang Lyu, Piji Li, Yechang Yang, Maarten de Rijke, Pengjie Ren, Yukun Zhao, Dawei Yin, Zhaochun Ren

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on three NLU benchmark datasets. Experimental results show that DCT outperforms state-of-the-art baselines on out-of-distribution datasets while maintaining in-distribution performance.
Researcher Affiliation	Collaboration	1School of Computer Science and Technology, Shandong University, Qingdao, China 2College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China 3University of Amsterdam, Amsterdam, The Netherlands 4Baidu Inc., Beijing, China
Pseudocode	No	No structured pseudocode or algorithm blocks were found.
Open Source Code	Yes	1The code is available at https://github.com/youganglyu/DCT
Open Datasets	Yes	MNLI The MNLI dataset (Williams, Nangia, and Bowman 2018) SNLI The SNLI dataset (Bowman et al. 2015) FEVER The FEVER dataset (Thorne et al. 2018)
Dataset Splits	Yes	We evaluate the in-distribution and outof-distribution performance of models on the development set and the corresponding challenge set of each dataset.
Hardware Specification	No	No specific hardware details (e.g., GPU models, CPU, memory) used for running experiments are provided.
Software Dependencies	No	The paper mentions software like 'BERT-base' and 'Adam W' optimizer but does not provide specific version numbers for software dependencies such as Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	For the MNLI, SNLI and FEVER datasets, we train all models for 5 epochs; all models converge. [...] we adopt the Adam W (Loshchilov and Hutter 2019) optimizer as the optimizer with initial learning rate 3e-5. Meanwhile, the temperature parameter τ, threshold λ, momentum coefficient m and scalar weighting hyperparameter α are set to 0.04, 0.6, 0.999, and 0.1. The sizes of the least similar positive samples Sp and the most similar negative samples Sdn are set to 150 and 1.