Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

Authors: Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna M. Wallach, Steven Z. Wu, Alexandra Chouldechova

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments involving 11 real-world rating tasks and 9 commercial LLMs, we show that standard validation approaches that rely upon forced-choice ratings select judge systems that are highly suboptimal, performing as much as 31% worse than judge systems selected by our approach that uses multi-label response set ratings to account for rating indeterminacy.
Researcher Affiliation Collaboration Luke Guerdan Carnegie Mellon University Solon Barocas Microsoft Research Kenneth Holstein Carnegie Mellon University Hanna Wallach Microsoft Research Zhiwei Steven Wu Carnegie Mellon University Alexandra Chouldechova Microsoft Research
Pseudocode No The paper describes a theoretical framework and models (e.g., probabilistic rating model) using text and diagrams (Figures 3, 4, 9), but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We publicly release all code and data used in our experiments.2 2See https://github.com/lguerdan/indeterminacy.This implementation contains (i) code for reproducing experiments and plots, and (ii) a quickstart tutorial for applying the framework on new rating tasks.
Open Datasets Yes We leverage five datasets from Judge Bench [Tan et al., 2024]: SNLI [Bowman et al., 2015], MNLI [Williams et al., 2017], α-NLI [Nie et al., 2019], Summ Eval [Fabbri et al., 2021] and QAGS [Wang et al., 2020]. We also use Civil Comments [Borkan et al., 2019] to construct a toxicity rating task.
Dataset Splits Yes To match common LLM-as-a-judge meta-evaluation workflows that conduct analysis on a small corpus of items, we sub-sample all rating tasks to 200 ratings per item. We randomly sample items for all rating tasks apart from civil comments, which is sampled via a stratified random sampling approach to select comments with an observed agreement level in the range [0.2, 0.5].
Hardware Specification No The paper mentions the cost and time taken for experiments: 'The total cost of running all models was 199.76. Each rating task took 30 minutes to run when models were run in parallel, with the exception of Claude Sonnet 3.5, which took approximately 2 hours per rating task due to high API response latency.' However, it does not specify any particular hardware (GPU/CPU models, memory, etc.) used for these computations, likely because commercial LLMs were accessed via API.
Software Dependencies No The paper lists the commercial LLMs used in the experiments (e.g., GPT-{3.5-Turbo, 4o-Mini, o3-Mini}, Mistral-{Large, Small}, Claude-{3.5-Sonnet, 3-Haiku}, Deep Seek Chat, and LLama-3.3-70B-Instruct). While it also mentions code release, the paper text itself does not explicitly list any ancillary software dependencies (libraries, frameworks, or specific versions) required to run their experimental analysis code.
Experiment Setup Yes We sample all models with a temperature of 1.0. For all models, we also limit max_tokens used for generation to 5. This low max token limit feasible because only few tokens are needed to provide a forced-choice or response set rating (e.g., A , BB ). When using a reasoning-enabled model (e.g., o3-Mini), we set the max token length for the reasoning trace to 1024 tokens.