Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Debiasing Evaluations That Are Biased by Evaluations

Authors: Jingyan Wang, Ivan Stelmakh, Yuting Wei, Nihar Shah

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide theoretical guarantees on the performance of our algorithm, as well as experimental evaluations.
Researcher Affiliation	Academia	Jingyan Wang EMAIL Toyota Technological Institute at Chicago Chicago, IL 60637, USA; Ivan Stelmakh EMAIL New Economic School Moscow, Russia; Yuting Wei EMAIL Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA 19104, USA; Nihar Shah EMAIL School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA
Pseudocode	Yes	Algorithm 1: Cross-validation. Inputs: observations Y , partial ordering O, and set Λ.
Open Source Code	Yes	The code to reproduce our results is available at https://github.com/jingyanw/outcome-induced-debiasing.
Open Datasets	Yes	We use the grading data from Indiana University Bloomington Indiana University Bloomington (2020), where the possible grades that students receive are A+ through D-, and F. ... https://gradedistribution.registrar.indiana.edu/index.php [Online; accessed 30-Sep2020]. We now move to a real-world data (Kerzendorf et al., 2020) collected for proposal peer review at the European Southern Observatory (ESO).
Dataset Splits	Yes	In the data-splitting step, our algorithm splits the observations {yij}i [d],j [n] into a training set Ωt [d] [n] and a validation set Ωv [d] [n]. ... For each consecutive pair of elements in this sub-ordering, we assign one element in this pair to the training set and the other element to the validation set uniformly at random (Lines 5-7).
Hardware Specification	No	The paper does not explicitly describe the hardware used for its experiments. It discusses experimental evaluations but lacks details on specific GPU/CPU models or computing resources.
Software Dependencies	No	The paper mentions using 'CVXPY package' but does not specify a version number. No other software dependencies are listed with version numbers.
Experiment Setup	Yes	Throughout the experiments, we use Λ = {2i : 9 i 5, i Z} {0, }. We also plot the error incurred by the best ﬁxed choice of λ Λ, where for each point in the plots, we pick the value of λ Λ which minimizes the empirical ℓ2 error over all ﬁxed choices in Λ. ... Throughout the experiments we set x = 0 without loss of generality, because, as explained in Proposition 18 in Appendix C.2.1, the results remain the same for any value of x . ... We set η = 1 σ, and consider diﬀerent choices of σ.