Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bias in Evaluation Processes: An Optimization-Based Model

Authors: L. Elisa Celis, Amit Kumar, Anay Mehrotra, Nisheeth K. Vishnoi

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. Empirically, we evaluate our model s ability to emulate biases present in real-world evaluation processes using two real-world datasets (JEE-2009 Scores and the Semantic Scholar Open Research Corpus) and one synthetic dataset (Section 4).
Researcher Affiliation	Academia	L. Elisa Celis Yale University Amit Kumar IIT Delhi Anay Mehrotra Yale University Nisheeth K. Vishnoi Yale University
Pseudocode	No	The paper describes its optimization-based model and theoretical characterizations but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code for this paper is available at https://github.com/AnayMehrotra/Bias-in-Evaluation-Processes.
Open Datasets	Yes	Dataset 1 (JEE-2009 scores). This dataset contains the scores, birth category (official SES label [135]), and (binary) gender of all students from JEE-2009 (384,977 total) [91]. Dataset 2 (Semantic Scholar Open Research Corpus). This dataset contains the list of authors, the year of publication, and the number of citations for 46,947,044 research papers on Semantic Scholar.
Dataset Splits	Yes	Table 1: TV distances between best-fit densities and real data (Section 4) with 80%-20% training and testing data split
Hardware Specification	Yes	All simulations were run on a Mac Book Pro with 16 GB RAM and an Apple M2 Pro processor.
Software Dependencies	No	The paper mentions using the 'quad function in scipy' but does not provide specific version numbers for scipy or any other software dependencies.
Experiment Setup	Yes	For the grid search itself, we varied α over [10 4, 102], τ over [10 1, 10], and v0 over Ω.