Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improved Group Robustness via Classifier Retraining on Independent Splits

Authors: Thien Hang Nguyen, Hongyang R. Zhang, Huy Nguyen

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When evaluated on benchmark image and text classification tasks, our approach consistently performs favorably to group DRO, JTT, and other strong baselines when either group labels are available during training or are only given in validation sets. Importantly, our method only relies on a single hyperparameter, which adjusts the fraction of labels used for training feature extractors vs. training classification layers.
Researcher Affiliation Academia Thien Hang Nguyen EMAIL Northeastern University, Boston, MA
Pseudocode Yes Algorithm 1 Classifier Retraining on Independent Splits (CROIS) Input: Training data DL with group labels and training data without group labels DU. Classifier retraining algorithm R (default to group DRO). Optional splitting parameter p (default to 1). 1: Obtain validation sets by partitioning DL into D L and D(val) L and DU into D U and D(val) U . 2: (Optional) Add more unlabeled data via split proportion p: Partition D L into two parts, D1 and D2 such that |D1| = (1 p) |D L| and |D2| = p |D L|. Set D L D2 and D U D U D1. 3: Obtain the initial model f by running empirical risk minimization on D U and selecting the best model in terms of average accuracy on D(val) L D(val) U . 4: Perform classifier retraining R with feature extractor f on D L and then select the best model in terms of worst-group accuracy on D(val) L as the final output.
Open Source Code Yes Our implementation in Py Torch can be found at https://github.com/timmytonga/crois.
Open Datasets Yes We experiment on four datasets: Waterbird (Sagawa et al., 2020a). Combining the bird images from the CUB dataset (Welinder et al., 2010) with water or land backgrounds from the PLACES dataset (Zhou et al., 2017), the task is to classify whether an image contains a landbird or a waterbird without confounding with the background.
Dataset Splits Yes We use the original train-val-test split in all the datasets and report the test results.
Hardware Specification Yes We performed our experiments on 2 PCs with one NVIDIA RTX3070 and one NVIDIA RTX3090.
Software Dependencies No The paper mentions software like PyTorch, BERT, and Weights and Biases, but does not provide specific version numbers for these software components. For example, it states: "Our implementation in Py Torch can be found at https://github.com/timmytonga/crois." and "We use pretrained BERT (Devlin et al., 2018) for Multi NLI and Civil Comments." and "Experimental data is collected with the help of Weights and Biases (Biewald, 2020)." No version numbers are given.
Experiment Setup Yes Table 7 contains the hyperparameters used in our experiments in Sections 4.2 and 4.1. Note that these are the standard parameters for obtaining an ERM model for these datasets as in previous works (Sagawa et al., 2020a; Liu et al., 2021). The only difference is that we train Waterbird and Celeb A for slightly shorter epoch due to finding no further increase in validation accuracies after those epochs. Table 7: Hyperparameters used in the experiments. The slash indicates the parameters used in the first phase (feature extractor) versus the second phase (classifier retraining). Waterbird: Learning Rate 10^-4/10^-4, L2 Regularization 10^-4/10^-4, Number of Epochs 250/250.