Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bias Amplification Enhances Minority Group Performance
Authors: Gaotang Li, Jiarui Liu, Wei Hu
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, Bam achieves competitive performance compared with existing methods evaluated on spurious correlation benchmarks in computer vision and natural language processing. Moreover, we find a simple stopping criterion based on minimum class accuracy difference that can remove the need for group annotations, with little or no loss in worst-group accuracy. We perform extensive analyses and ablations to verify the effectiveness and robustness of our algorithm in varying class and group imbalance ratios. |
| Researcher Affiliation | Academia | Gaotang Li EMAIL University of Michigan Ann Arbor, MI Jiarui Liu EMAIL Carnegie Mellon University Pittsburgh, PA Wei Hu EMAIL University of Michigan Ann Arbor, MI |
| Pseudocode | Yes | Algorithm 1 Bam Input: Training dataset D, number of epochs T in Stage 1, auxiliary coefficient λ, and upweight factor µ |
| Open Source Code | Yes | 1Our code is available at https://github.com/motivationss/BAM |
| Open Datasets | Yes | We conduct our experiments on four popular benchmark datasets containing spurious correlations. Two of them are image datasets: Waterbirds (Wah et al., 2011; Sagawa et al., 2019), Celeb A (Liu et al., 2015; Sagawa et al., 2019), and the other two are NLP datasets: Multi NLI (Williams et al., 2018; Sagawa et al., 2019), and Civil Comments-WILDS (Borkan et al., 2019; Koh et al., 2021). |
| Dataset Splits | Yes | The train/validation/test splits is followed from Sagawa et al. (2019). We shuffle the original data and generate dataset splits with train/valudation/test sizes = 0.7/0.15/0.15. We regenerate the dataset splits as train/valudation/test sizes = 0.7/0.15/0.15. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. It mentions using pre-trained models (ResNet-50, BERT) but no specific GPU/CPU models or other hardware details. |
| Software Dependencies | No | The paper mentions using "Pytorch implementation for Res Net50 and the Hugging Face implementation for BERT" but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Table 6: Hyperparameters tuned over 4 datasets. Dataset Auxiliary coefficient (λ) #Epochs in Stage 1 (T) Upweight factor (µ) Waterbirds { 0.5, 5, 50} {10, 15, 20} {50, 100, 140} Celeb A {0.5, 5, 50} {1, 2} {50, 70, 100} Multi NLI {0.5, 5, 50} {1, 2} {4, 5, 6} Civil Comments {0.5, 5, 50 } {1, 2} {4, 5, 6}. In general, our setting follows closely from Liu et al. (2021), with some minor discrepancies. For the major hyperparameters, We tuned over λ = {0.5, 5, 50}, T = {1, 2, 10, 15, 60} and µ = {4, 5, 6, 50, 70, 100, 140} for Bam. |