Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

Authors: Felipe Maia Polo, Xinhe Wang, Mikhail Yurochkin, Gongjun Xu, Moulinath Banerjee, Yuekai Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Validating our framework using six different LLM judges and queries from Big Gen Bench and Chatbot Arena. We show that we can better align LLMs to human annotators using a few labeled data points and interpret the systematic differences between the two types of judges.
Researcher Affiliation	Academia	1Department of Statistics, University of Michigan 2Institute of Foundation Models, MBZUAI
Pseudocode	Yes	Model ﬁtting via the logit trick 1. For each example (Ii, Oi), compute P(Y l i = k \| Ii, Oi) for all k {0, . . . , K}. 2. Compute Zl i and cutoffs {ηk}K k=1 by solving: {ηk}K k=1, {Zl i}n i=1 = arg min η1< < ηK R, z1,...,zn R pk( η1, . . . , ηK, zi) P(Y l i = k \| Ii, Oi) . 3. Fit the ordinal logistic model to the human labels {Y h i }n i=1 using maximum likelihood, with Zh i = (1/β)Zl i (1/β)γ Xi as inputs, i.e., ({ˆαk}K k=1, ˆβ, ˆγ) = arg max α1< <αK, β R,γ Rd k=0 1{Y h i = k} log pk α1, . . . , αK, (1/β)Zl i (1/β)γ Xi .
Open Source Code	Yes	1Please check our Git Hub repository: https://github.com/felipemaiapolo/bridge
Open Datasets	Yes	Big Gen Bench (BGB): BGB [17] evaluates language model outputs based on detailed rubrics across ﬁve satisfaction levels... Chatbot Arena (CA): We use the dataset arena-human-preference-100k [40], derived from Chatbot Arena [6].
Dataset Splits	Yes	Using a ﬁxed random seed, we split each dataset into training and testing sets with an 80:20 ratio... For a given sample size ntr {20, 40, 80, 160, 320}, we randomly select ntr points from the full set of training queries to ﬁt our models. Across all datasets, we perform this procedure using 10 different random seeds, and the reported results reﬂect averages and standard deviations across these splits.
Hardware Specification	No	The paper does not explicitly mention specific hardware (GPU/CPU models, memory, etc.) used for running its experiments. The self-reflection in the NeurIPS checklist states 'we ﬁt a standard statistical model (ordinal logistic regression). This model is known to be cheap to ﬁt.', implying no specific hardware details were provided.
Software Dependencies	No	We use the Python package statsmodels for this adjustment: https://www.statsmodels.org/dev/ generated/statsmodels.stats.multitest.multipletests.html
Experiment Setup	Yes	The model can be ﬁtted using non-linear least squares using human labels. The model ﬁtting algorithm via the logit trick... Fit the ordinal logistic model to the human labels {Y h i }n i=1 using maximum likelihood... We consider two strategies for computing P(Y l = k \| I, O). The ﬁrst is based on log probabilities... As an alternative, we employ a chain-of-thought (Co T) prompting strategy, in which the LLM produces reasoning followed by a rating. In this case, we sample m outputs from the LLM and estimate P(Y l = k \| I, O) via empirical frequencies... we regularize the output probabilities by adding a small constant (e.g., 0.01)... As we demonstrate in our experiments, even under the simplifying assumption γ = 0... For a given sample size ntr {20, 40, 80, 160, 320}, we randomly select ntr points from the full set of training queries to ﬁt our models.