Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bridging Human and LLM Judgments: Understanding and Narrowing the Gap
Authors: Felipe Maia Polo, Xinhe Wang, Mikhail Yurochkin, Gongjun Xu, Moulinath Banerjee, Yuekai Sun
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Validating our framework using six different LLM judges and queries from Big Gen Bench and Chatbot Arena. We show that we can better align LLMs to human annotators using a few labeled data points and interpret the systematic differences between the two types of judges. |
| Researcher Affiliation | Academia | 1Department of Statistics, University of Michigan 2Institute of Foundation Models, MBZUAI |
| Pseudocode | Yes | Model fitting via the logit trick 1. For each example (Ii, Oi), compute P(Y l i = k | Ii, Oi) for all k {0, . . . , K}. 2. Compute Zl i and cutoffs {ηk}K k=1 by solving: {ηk}K k=1, {Zl i}n i=1 = arg min η1< < ηK R, z1,...,zn R pk( η1, . . . , ηK, zi) P(Y l i = k | Ii, Oi) . 3. Fit the ordinal logistic model to the human labels {Y h i }n i=1 using maximum likelihood, with Zh i = (1/β)Zl i (1/β)γ Xi as inputs, i.e., ({ˆαk}K k=1, ˆβ, ˆγ) = arg max α1< <αK, β R,γ Rd k=0 1{Y h i = k} log pk α1, . . . , αK, (1/β)Zl i (1/β)γ Xi . |
| Open Source Code | Yes | 1Please check our Git Hub repository: https://github.com/felipemaiapolo/bridge |
| Open Datasets | Yes | Big Gen Bench (BGB): BGB [17] evaluates language model outputs based on detailed rubrics across five satisfaction levels... Chatbot Arena (CA): We use the dataset arena-human-preference-100k [40], derived from Chatbot Arena [6]. |
| Dataset Splits | Yes | Using a fixed random seed, we split each dataset into training and testing sets with an 80:20 ratio... For a given sample size ntr {20, 40, 80, 160, 320}, we randomly select ntr points from the full set of training queries to fit our models. Across all datasets, we perform this procedure using 10 different random seeds, and the reported results reflect averages and standard deviations across these splits. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware (GPU/CPU models, memory, etc.) used for running its experiments. The self-reflection in the NeurIPS checklist states 'we fit a standard statistical model (ordinal logistic regression). This model is known to be cheap to fit.', implying no specific hardware details were provided. |
| Software Dependencies | No | We use the Python package statsmodels for this adjustment: https://www.statsmodels.org/dev/ generated/statsmodels.stats.multitest.multipletests.html |
| Experiment Setup | Yes | The model can be fitted using non-linear least squares using human labels. The model fitting algorithm via the logit trick... Fit the ordinal logistic model to the human labels {Y h i }n i=1 using maximum likelihood... We consider two strategies for computing P(Y l = k | I, O). The first is based on log probabilities... As an alternative, we employ a chain-of-thought (Co T) prompting strategy, in which the LLM produces reasoning followed by a rating. In this case, we sample m outputs from the LLM and estimate P(Y l = k | I, O) via empirical frequencies... we regularize the output probabilities by adding a small constant (e.g., 0.01)... As we demonstrate in our experiments, even under the simplifying assumption γ = 0... For a given sample size ntr {20, 40, 80, 160, 320}, we randomly select ntr points from the full set of training queries to fit our models. |