Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robustness to Spurious Correlations via Human Annotations

Authors: Megha Srivastava, Tatsunori Hashimoto, Percy Liang

ICML 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we show improvements of 5 10% on a digit recognition task confounded by rotation, and 1.5 5% on the task of analyzing NYPD Police Stops confounded by location.
Researcher Affiliation	Academia	1Computer Science Department, Stanford University. Correspondence to: Megha Srivastava <EMAIL>.
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Reproducibility We provide all source code, data, and experiments as part of a worksheet on the Coda Lab platform: https://bit.ly/uvdro-codalab.
Open Datasets	Yes	We evaluate the efﬁcacy of UV-DRO on synthetic domain shifts on the MNIST digit classiﬁcation task. [...] We consider the task of trying to detect false positives or police stops that do not result in arrests by training classiﬁers on data from police stops spanning 20032014 in New York City (NYCLU, 2019).
Dataset Splits	Yes	We tuned hyperparameters such as the learning rate, regularization, and DRO parameters using a held-out validation set, which we describe in the appendix.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions the use of "Fast Text Sent2Vec library" but does not specify its version number. It also refers to optimization methods like "batch gradient descent with Ada Grad" but without corresponding software package versions.
Experiment Setup	No	The paper states: "We tuned hyperparameters such as the learning rate, regularization, and DRO parameters using a held-out validation set, which we describe in the appendix." While it indicates that hyperparameters were tuned and described elsewhere, it does not provide their specific values or detailed configuration settings in the main text.