Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-Agent Debate for LLM Judges with Adaptive Stability Detection

Authors: Tianyu Hu, Zhen Tan, Song Wang, Huaizhi Qu, Tianlong Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across diverse benchmarks and models demonstrate significant improvements in judgment accuracy over majority voting while maintaining computational efficiency.
Researcher Affiliation Academia Tianyu Hu EMAIL Zhen Tan Arizona State University EMAIL Song Wang University of Central Florida EMAIL Huaizhi Qu UNC Chapel Hill EMAIL Tianlong Chen UNC Chapel Hill EMAIL
Pseudocode Yes This procedure is summarized in Algorithm 1 in the appendix. See Algorithm 2 in the appendix.
Open Source Code No All the code and data are clearly referenced in the paper, and we will provide the code and data upon acceptance.
Open Datasets Yes We conduct experiments on datasets from diverse domains to evaluate the debate judge s performance, including: hallucination detection: Truthful QA [Lin et al., 2022], alignment evaluation: Judge Bench [Tan et al., 2025a] and LLMBar [Zeng et al., 2024], and reasoning: BIG-Bench [Srivastava et al., 2023]. We also use multiple multi-modal datasets: MLLM-Judge [Chen et al., 2024a] and Judge Anything [Pu et al., 2025].
Dataset Splits No The paper mentions specific sampling strategies for some datasets like MLLM-Judge ("randomly sampling 1,000 entries from the 6,165 available") and Truthful QA ("randomly select one correct and two incorrect answers"), and using "all available instances" for LLMBar. However, it does not provide explicit training/test/validation splits (e.g., percentages, specific counts) for all datasets needed for full reproduction, nor does it refer to standard predefined splits for all mentioned datasets.
Hardware Specification Yes For all experiments, we utilized a consistent hardware environment consisting of two NVIDIA Tesla A100 GPUs (40GB VRAM each) and two Intel Xeon 12-core CPUs operating at 3.0GHz with 256GB RAM. The system ran Ubuntu 20.04.5 LTS with CUDA 12.4.
Software Dependencies No The system ran Ubuntu 20.04.5 LTS with CUDA 12.4. For closed-source model (Gemini-2.0-Flash), we use the Vertex AI platform with model gemini-2.0flash-001 for all experiments. For open-source models (Gemma-3-4B, Qwen-2.5-7B, Qwen-2.5-VL7B and Llama-3.1-8B), we deployed them using the vllm library. The paper explicitly lists CUDA 12.4. It mentions other software like vllm and Vertex AI platform, but does not provide specific version numbers for these libraries or platforms, which is required by the question for multiple key components.
Experiment Setup Yes All experiments maintain consistent hyperparameters unless otherwise specified, with a default sampling temperature of 1.0 to balance response diversity and coherence. Ensemble size is set to 7, and the maximum debate rounds are capped at 10. The max model length for all models was set to 16,000 tokens. The algorithm terminates when the log-likelihood improvement is less than a convergence threshold ϵ = 10 6, or after a maximum of n = 100 iterations. The judgement accuracy modeling process halts once Dt < 0.05 for 2 consecutive rounds.