Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Distributional LLM-as-a-Judge

Authors: Luyu Chen, Zeyu Zhang, Haoran Tan, Quanyu Dai, Yang Hao, Zhenhua Dong, Xu Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across various LLM backbones and evaluation tasks demonstrate that our framework significantly outperforms existing closed-source LLMs and conventional singlepoint alignment methods, with superior alignment quality, strong robustness, and competitive evaluation accuracy. We evaluate our framework using representative datasets [15] from three fundamental LLM-as-a-Judge applications: dataset labeling, quality evaluation, and pairwise preference prediction. The primary experimental results are summarized in Table 1.
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China 2Huawei Noah s Ark Lab
Pseudocode	No	The paper describes the methodology using mathematical formulations and descriptive text, but it does not include a distinct section or figure explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code	Yes	We provide the source code and datasets in the supplemental material.
Open Datasets	Yes	Datasets. We evaluate our framework using representative datasets [15] from three fundamental LLM-as-a-Judge applications: dataset labeling, quality evaluation, and pairwise preference prediction. Dataset Labeling (SNLI [41]/MNLI [42])... Quality Evaluation (Summ Eval [35])... Pairwise Preference Prediction (MT-Bench [43]).
Dataset Splits	Yes	All datasets are split into training and test sets at an 8:2 ratio in our experiments.
Hardware Specification	Yes	We conduct all experiments on one NVIDIA A100-40G GPU.
Software Dependencies	No	Specifically, we employ the Adam W [47] optimizer with a learning rate of 5 10 5 and train each model for 2 epochs. To enhance model robustness, we incorporate adversarial training, setting the perturbation step size to 0.05 and performing 5 gradient ascent steps per training iteration. Additionally, we conduct a hyperparameter search for two critical parameters: the weight parameter α, chosen from the set {0, 0.2, 0.4, 0.6, 0.8, 1.0}, and the perturbation radius parameter ϵ, selected from the set {0.0, 0.05, 0.1, 0.15, 0.2, 0.25}.
Experiment Setup	Yes	Specifically, we employ the Adam W [47] optimizer with a learning rate of 5 10 5 and train each model for 2 epochs. To enhance model robustness, we incorporate adversarial training, setting the perturbation step size to 0.05 and performing 5 gradient ascent steps per training iteration. Additionally, we conduct a hyperparameter search for two critical parameters: the weight parameter α, chosen from the set {0, 0.2, 0.4, 0.6, 0.8, 1.0}, and the perturbation radius parameter ϵ, selected from the set {0.0, 0.05, 0.1, 0.15, 0.2, 0.25}.