Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Distributional LLM-as-a-Judge
Authors: Luyu Chen, Zeyu Zhang, Haoran Tan, Quanyu Dai, Yang Hao, Zhenhua Dong, Xu Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across various LLM backbones and evaluation tasks demonstrate that our framework significantly outperforms existing closed-source LLMs and conventional singlepoint alignment methods, with superior alignment quality, strong robustness, and competitive evaluation accuracy. We evaluate our framework using representative datasets [15] from three fundamental LLM-as-a-Judge applications: dataset labeling, quality evaluation, and pairwise preference prediction. The primary experimental results are summarized in Table 1. |
| Researcher Affiliation | Collaboration | 1Gaoling School of Artificial Intelligence, Renmin University of China 2Huawei Noah s Ark Lab |
| Pseudocode | No | The paper describes the methodology using mathematical formulations and descriptive text, but it does not include a distinct section or figure explicitly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | We provide the source code and datasets in the supplemental material. |
| Open Datasets | Yes | Datasets. We evaluate our framework using representative datasets [15] from three fundamental LLM-as-a-Judge applications: dataset labeling, quality evaluation, and pairwise preference prediction. Dataset Labeling (SNLI [41]/MNLI [42])... Quality Evaluation (Summ Eval [35])... Pairwise Preference Prediction (MT-Bench [43]). |
| Dataset Splits | Yes | All datasets are split into training and test sets at an 8:2 ratio in our experiments. |
| Hardware Specification | Yes | We conduct all experiments on one NVIDIA A100-40G GPU. |
| Software Dependencies | No | Specifically, we employ the Adam W [47] optimizer with a learning rate of 5 10 5 and train each model for 2 epochs. To enhance model robustness, we incorporate adversarial training, setting the perturbation step size to 0.05 and performing 5 gradient ascent steps per training iteration. Additionally, we conduct a hyperparameter search for two critical parameters: the weight parameter α, chosen from the set {0, 0.2, 0.4, 0.6, 0.8, 1.0}, and the perturbation radius parameter ϵ, selected from the set {0.0, 0.05, 0.1, 0.15, 0.2, 0.25}. |
| Experiment Setup | Yes | Specifically, we employ the Adam W [47] optimizer with a learning rate of 5 10 5 and train each model for 2 epochs. To enhance model robustness, we incorporate adversarial training, setting the perturbation step size to 0.05 and performing 5 gradient ascent steps per training iteration. Additionally, we conduct a hyperparameter search for two critical parameters: the weight parameter α, chosen from the set {0, 0.2, 0.4, 0.6, 0.8, 1.0}, and the perturbation radius parameter ϵ, selected from the set {0.0, 0.05, 0.1, 0.15, 0.2, 0.25}. |