Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improved Group Robustness via Classifier Retraining on Independent Splits

Authors: Thien Hang Nguyen, Hongyang R. Zhang, Huy Nguyen

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When evaluated on benchmark image and text classification tasks, our approach consistently performs favorably to group DRO, JTT, and other strong baselines when either group labels are available during training or are only given in validation sets. Importantly, our method only relies on a single hyperparameter, which adjusts the fraction of labels used for training feature extractors vs. training classification layers.
Researcher Affiliation	Academia	Thien Hang Nguyen EMAIL Northeastern University, Boston, MA
Pseudocode	Yes	Algorithm 1 Classifier Retraining on Independent Splits (CROIS) Input: Training data DL with group labels and training data without group labels DU. Classifier retraining algorithm R (default to group DRO). Optional splitting parameter p (default to 1). 1: Obtain validation sets by partitioning DL into D L and D(val) L and DU into D U and D(val) U . 2: (Optional) Add more unlabeled data via split proportion p: Partition D L into two parts, D1 and D2 such that \|D1\| = (1 p) \|D L\| and \|D2\| = p \|D L\|. Set D L D2 and D U D U D1. 3: Obtain the initial model f by running empirical risk minimization on D U and selecting the best model in terms of average accuracy on D(val) L D(val) U . 4: Perform classifier retraining R with feature extractor f on D L and then select the best model in terms of worst-group accuracy on D(val) L as the final output.
Open Source Code	Yes	Our implementation in Py Torch can be found at https://github.com/timmytonga/crois.
Open Datasets	Yes	We experiment on four datasets: Waterbird (Sagawa et al., 2020a). Combining the bird images from the CUB dataset (Welinder et al., 2010) with water or land backgrounds from the PLACES dataset (Zhou et al., 2017), the task is to classify whether an image contains a landbird or a waterbird without confounding with the background.
Dataset Splits	Yes	We use the original train-val-test split in all the datasets and report the test results.
Hardware Specification	Yes	We performed our experiments on 2 PCs with one NVIDIA RTX3070 and one NVIDIA RTX3090.
Software Dependencies	No	The paper mentions software like PyTorch, BERT, and Weights and Biases, but does not provide specific version numbers for these software components. For example, it states: "Our implementation in Py Torch can be found at https://github.com/timmytonga/crois." and "We use pretrained BERT (Devlin et al., 2018) for Multi NLI and Civil Comments." and "Experimental data is collected with the help of Weights and Biases (Biewald, 2020)." No version numbers are given.
Experiment Setup	Yes	Table 7 contains the hyperparameters used in our experiments in Sections 4.2 and 4.1. Note that these are the standard parameters for obtaining an ERM model for these datasets as in previous works (Sagawa et al., 2020a; Liu et al., 2021). The only difference is that we train Waterbird and Celeb A for slightly shorter epoch due to finding no further increase in validation accuracies after those epochs. Table 7: Hyperparameters used in the experiments. The slash indicates the parameters used in the first phase (feature extractor) versus the second phase (classifier retraining). Waterbird: Learning Rate 10^-4/10^-4, L2 Regularization 10^-4/10^-4, Number of Epochs 250/250.