Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Distributional LLM-as-a-Judge

Authors: Luyu Chen, Zeyu Zhang, Haoran Tan, Quanyu Dai, Yang Hao, Zhenhua Dong, Xu Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across various LLM backbones and evaluation tasks demonstrate that our framework significantly outperforms existing closed-source LLMs and conventional singlepoint alignment methods, with superior alignment quality, strong robustness, and competitive evaluation accuracy. We evaluate our framework using representative datasets [15] from three fundamental LLM-as-a-Judge applications: dataset labeling, quality evaluation, and pairwise preference prediction. The primary experimental results are summarized in Table 1.
Researcher Affiliation Collaboration 1Gaoling School of Artificial Intelligence, Renmin University of China 2Huawei Noah s Ark Lab
Pseudocode No The paper describes the methodology using mathematical formulations and descriptive text, but it does not include a distinct section or figure explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code Yes We provide the source code and datasets in the supplemental material.
Open Datasets Yes Datasets. We evaluate our framework using representative datasets [15] from three fundamental LLM-as-a-Judge applications: dataset labeling, quality evaluation, and pairwise preference prediction. Dataset Labeling (SNLI [41]/MNLI [42])... Quality Evaluation (Summ Eval [35])... Pairwise Preference Prediction (MT-Bench [43]).
Dataset Splits Yes All datasets are split into training and test sets at an 8:2 ratio in our experiments.
Hardware Specification Yes We conduct all experiments on one NVIDIA A100-40G GPU.
Software Dependencies No Specifically, we employ the Adam W [47] optimizer with a learning rate of 5 10 5 and train each model for 2 epochs. To enhance model robustness, we incorporate adversarial training, setting the perturbation step size to 0.05 and performing 5 gradient ascent steps per training iteration. Additionally, we conduct a hyperparameter search for two critical parameters: the weight parameter α, chosen from the set {0, 0.2, 0.4, 0.6, 0.8, 1.0}, and the perturbation radius parameter ϵ, selected from the set {0.0, 0.05, 0.1, 0.15, 0.2, 0.25}.
Experiment Setup Yes Specifically, we employ the Adam W [47] optimizer with a learning rate of 5 10 5 and train each model for 2 epochs. To enhance model robustness, we incorporate adversarial training, setting the perturbation step size to 0.05 and performing 5 gradient ascent steps per training iteration. Additionally, we conduct a hyperparameter search for two critical parameters: the weight parameter α, chosen from the set {0, 0.2, 0.4, 0.6, 0.8, 1.0}, and the perturbation radius parameter ϵ, selected from the set {0.0, 0.05, 0.1, 0.15, 0.2, 0.25}.