Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Spread Spurious Attribute: Improving Worst-group Accuracy with Spurious Attribute Estimation

Authors: Junhyun Nam, Jaehyung Kim, Jaeho Lee, Jinwoo Shin

ICLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on various benchmark datasets show that our algorithm consistently outperforms the baseline methods using the same number of group-labeled samples.
Researcher Affiliation	Academia	Junhyun Nam1, Jaehyung Kim1, Jaeho Lee2 , Jinwoo Shin1 1KAIST, 2POSTECH EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Spread Spurious Attribute
Open Source Code	Yes	Also, we provide our source code as a part of the open-to-public supplementary materials.
Open Datasets	Yes	Waterbirds (Sagawa et al., 2020)... Caltech-UCSD Birds dataset (Wah et al., 2011) with landscapes from Places (Zhou et al., 2017)., Celeb A (Liu ets al., 2015), Multi NLI (Williams et al., 2018), Civil Comments-WILDS (Borkan et al., 2019; Koh et al., 2021), CIFAR-10 (Krizhevsky et al., 2009).
Dataset Splits	Yes	For all datasets, we use the validation split of the dataset as the group-labeled set. and We use D L, D U to train the spurious attribute predictor that make prediction on D U, and validate the model with D L.
Hardware Specification	Yes	In Table 10, we provide the time required for the pseudo-labeling phase and the robust training phase on a single Nvidia Titan XP for each dataset.
Software Dependencies	No	The paper mentions software like torchvision and huggingface implementations, and optimizers like SGD and Adam W, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	For Waterbirds and Celeb A, we tuned the learning rate over {1e3, 1e-4, 1e-5} and ℓ2 regularization over {1e-1, 1e-4}. We used SGD optimizer with momentum 0.9 and batch size 64. In pseudo-labeling phase, we train the spurious attribute predictor 1k iterations for Waterbirds and 45k iterations for Celeb A.