Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Selective Classification Can Magnify Disparities Across Groups

Authors: Erik Jones, Shiori Sagawa, Pang Wei Koh, Ananya Kumar, Percy Liang

ICLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We observe this behavior consistently across ﬁve vision and NLP datasets. and We consider ﬁve datasets (Table 1)
Researcher Affiliation	Academia	Department of Computer Science, Stanford University EMAIL
Pseudocode	Yes	Algorithm 1: Group-agnostic reference for (ˆy, ˆc) at threshold τ and Algorithm 2: Robin Hood reference at threshold τ
Open Source Code	Yes	All code, data, and experiments are available on Coda Lab at https://worksheets. codalab.org/worksheets/0x7ceb817d53b94b0c8294a7a22643bf5e. The code is also available on Git Hub at https://github.com/ejones313/worst-group-sc.
Open Datasets	Yes	We consider ﬁve datasets (Table 1) on which prior work has shown that models latch onto spurious correlations...Celeb A. ... dataset (Liu et al., 2015). Waterbirds. ... dataset (Sagawa et al., 2020), constructed using images of birds from the Caltech-UCSD Birds dataset (Wah et et al., 2011) placed on backgrounds from the Places dataset (Zhou et al., 2017). Che Xpert-device. ... Che Xpert dataset (Irvin et al., 2019)... Civil Comments. ... dataset (Borkan et al., 2019). Multi NLI. ... Multi NLI dataset (Williams et al., 2018).
Dataset Splits	Yes	We use the ofﬁcial train-val-split of the dataset. and we ﬁrst create a new 80/10/10 train/val/test split of examples from the publicly available Che Xpert train and validation sets
Hardware Specification	No	No specific hardware details (like GPU/CPU models, specific processors, or memory amounts) were mentioned for running experiments. The paper only discusses training parameters.
Software Dependencies	No	No specific version numbers for software dependencies were provided. The paper mentions using 'bert-base-uncased using the implementation from Wolf et al. (2019)' but does not specify a version for the Hugging Face Transformers library or other software.
Experiment Setup	Yes	For ERM we optimize with learning rate 1e-4, weight decay 1e-4, batch size 128, and train for 50 epochs.