reproducibilityindex.ai

When Does Group Invariant Learning Survive Spurious Correlations?

Authors: Yimeng Chen, Ruibin Xiong, Zhi-Ming Ma, Yanyan Lan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the effectiveness of our proposed strategy, we conduct experiments on both synthetic and real data benchmarks on spurious correlations shifts in image classification and natural language inference (NLI). Specifically, we adopt two different invariant learning objectives, IRM (IRMv1) [4] and REx (V-REx) [15], to show the consistency of SCILL. To show the availability of SCILL, we also experiment with PGI [1] and c MMD [16; 1], which are feature invariance targets used with EIIL [8] in [1]. The experimental results show that SCILL with all the four invariance objectives consistently outperforms the existing state-of-the-art method EIIL in generalizing to spurious correlation shifts. Ablation study further shows the effectiveness of each component in SCILL.
Researcher Affiliation	Collaboration	Yimeng Chen1,2 , Ruibin Xiong3, Zhiming Ma1,2, Yanyan Lan4,5 1Academy of Mathematics and Systems Science, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Baidu Inc. 4Institute for AI Industry Research, Tsinghua University 5Beijing Academy of Artificial Intelligence, Beijing, China
Pseudocode	No	The paper describes algorithms in text (e.g., "statistical-split algorithm"), but it does not provide formal pseudocode blocks or clearly labeled algorithm boxes.
Open Source Code	Yes	1Code is availiable at https://github.com/Beastlyprime/group-invariant-learning.
Open Datasets	Yes	We conduct experiments on both synthetic and real-world datasets. The synthetic dataset, Patched-Colored MNIST (PC-MNIST), is constructed as a realization of the conditions in the Proposition 4.5 to verify the proposed criteria. It is derived from MNIST, by assigning two conditionally independent spurious features given label, namely the color and patch bias to each image. The design of the patch bias is inspired by [5]. MNLI-HANS is a benchmark widely used in many previous works on combating spurious correlations, such as [7; 34]. In our experiments, we follow the practice to utilize MNLI [39] as the training data and HANS [25] as the test data.
Dataset Splits	Yes	For MNLI, we use a BERT-based classifier with the standard setup for sentence pair classification [10]. The reference model is the same as the biased classifier propose in [34], which is trained on top of some hand-crafted syntactic features. For each task, all implementations of SCILL and EIIL adopt the same model configurations and pretrained reference models. Since models are tested with OOD data, it is important to specify the model selection strategy, as has been revealed by Gulrajani and Lopez-Paz [12] for the case of domain generalization. In our experiments, we report results with 3 different model selection strategies, including ID, Oracle, and TEV. ID refers to the strategy based on model performance on the in-distribution validation set as used in [34]. Oracle refers to the selection based on data from the test data distribution, as used in [8; 12]. While TEV is a new strategy adapted from the training-domain validation method in [12] to the inferred groups, which alleviates the dependence on the test data as ID. Details can be found in the appendix.
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as GPU models, CPU specifications, or memory, stating only that "All experiments ran on..." (which is not found in the provided text).
Software Dependencies	No	The paper mentions software like "BERT-based classifier" and "MLP", but does not provide specific version numbers for these or other libraries/frameworks (e.g., PyTorch, TensorFlow, Python version) that would be needed for reproducibility.
Experiment Setup	Yes	The training configurations are presented as follows. For PC-MNIST, we adopt the classifier proposed in [4] for Colored MNIST, which is a MLP with two hidden layers of 390 neurons. The reference model is a MLP with the same structure trained with ERM on the training set, following the setting in EIIL on Colored MNIST. While for MNLI, we use a BERT-based classifier with the standard setup for sentence pair classification [10]. The reference model is the same as the biased classifier propose in [34], which is trained on top of some hand-crafted syntactic features. For each task, all implementations of SCILL and EIIL adopt the same model configurations and pretrained reference models.