Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Debiasing Evaluations That Are Biased by Evaluations
Authors: Jingyan Wang, Ivan Stelmakh, Yuting Wei, Nihar Shah
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical guarantees on the performance of our algorithm, as well as experimental evaluations. |
| Researcher Affiliation | Academia | Jingyan Wang EMAIL Toyota Technological Institute at Chicago Chicago, IL 60637, USA; Ivan Stelmakh EMAIL New Economic School Moscow, Russia; Yuting Wei EMAIL Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA 19104, USA; Nihar Shah EMAIL School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA |
| Pseudocode | Yes | Algorithm 1: Cross-validation. Inputs: observations Y , partial ordering O, and set Λ. |
| Open Source Code | Yes | The code to reproduce our results is available at https://github.com/jingyanw/outcome-induced-debiasing. |
| Open Datasets | Yes | We use the grading data from Indiana University Bloomington Indiana University Bloomington (2020), where the possible grades that students receive are A+ through D-, and F. ... https://gradedistribution.registrar.indiana.edu/index.php [Online; accessed 30-Sep2020]. We now move to a real-world data (Kerzendorf et al., 2020) collected for proposal peer review at the European Southern Observatory (ESO). |
| Dataset Splits | Yes | In the data-splitting step, our algorithm splits the observations {yij}i [d],j [n] into a training set Ωt [d] [n] and a validation set Ωv [d] [n]. ... For each consecutive pair of elements in this sub-ordering, we assign one element in this pair to the training set and the other element to the validation set uniformly at random (Lines 5-7). |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for its experiments. It discusses experimental evaluations but lacks details on specific GPU/CPU models or computing resources. |
| Software Dependencies | No | The paper mentions using 'CVXPY package' but does not specify a version number. No other software dependencies are listed with version numbers. |
| Experiment Setup | Yes | Throughout the experiments, we use Λ = {2i : 9 i 5, i Z} {0, }. We also plot the error incurred by the best fixed choice of λ Λ, where for each point in the plots, we pick the value of λ Λ which minimizes the empirical ℓ2 error over all fixed choices in Λ. ... Throughout the experiments we set x = 0 without loss of generality, because, as explained in Proposition 18 in Appendix C.2.1, the results remain the same for any value of x . ... We set η = 1 σ, and consider different choices of σ. |