Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bias in Evaluation Processes: An Optimization-Based Model
Authors: L. Elisa Celis, Amit Kumar, Anay Mehrotra, Nisheeth K. Vishnoi
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. Empirically, we evaluate our model s ability to emulate biases present in real-world evaluation processes using two real-world datasets (JEE-2009 Scores and the Semantic Scholar Open Research Corpus) and one synthetic dataset (Section 4). |
| Researcher Affiliation | Academia | L. Elisa Celis Yale University Amit Kumar IIT Delhi Anay Mehrotra Yale University Nisheeth K. Vishnoi Yale University |
| Pseudocode | No | The paper describes its optimization-based model and theoretical characterizations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for this paper is available at https://github.com/AnayMehrotra/Bias-in-Evaluation-Processes. |
| Open Datasets | Yes | Dataset 1 (JEE-2009 scores). This dataset contains the scores, birth category (official SES label [135]), and (binary) gender of all students from JEE-2009 (384,977 total) [91]. Dataset 2 (Semantic Scholar Open Research Corpus). This dataset contains the list of authors, the year of publication, and the number of citations for 46,947,044 research papers on Semantic Scholar. |
| Dataset Splits | Yes | Table 1: TV distances between best-fit densities and real data (Section 4) with 80%-20% training and testing data split |
| Hardware Specification | Yes | All simulations were run on a Mac Book Pro with 16 GB RAM and an Apple M2 Pro processor. |
| Software Dependencies | No | The paper mentions using the 'quad function in scipy' but does not provide specific version numbers for scipy or any other software dependencies. |
| Experiment Setup | Yes | For the grid search itself, we varied α over [10 4, 102], τ over [10 1, 10], and v0 over Ω. |