Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On Evaluating LLM Alignment by Evaluating LLMs as Judges
Authors: Yixin Liu, Pengfei Liu, Arman Cohan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation shows that our proposed benchmark, ALIGNEVAL, matches or surpasses widely used automatic LLM evaluation benchmarks, such as Alpaca Eval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs. |
| Researcher Affiliation | Academia | Yixin Liu1 Pengfei Liu2 Arman Cohan1 1Yale University 2Shanghai Jiao Tong University EMAIL |
| Pseudocode | No | The paper describes methodologies and presents evaluation results, but it does not include any explicitly labeled pseudocode or algorithm blocks for its proposed method. It does, however, include a prompt template in Appendix B for evaluating LLMs as judges. |
| Open Source Code | Yes | 1ALIGNEVAL is available at https://github.com/yale-nlp/Align Eval. |
| Open Datasets | Yes | We select data sources for the instruction set (i.e., evaluation instances) I required to measure the GE-consistency: Alpaca Eval [22, 9] and Arena-Hard [21], with 805 and 500 instructions, respectively. |
| Dataset Splits | No | The paper refers to instruction sets like Alpaca Eval and Arena-Hard and describes filtering processes for evaluation instances. However, it does not specify explicit train/test/validation splits for its own experimental setup; rather, it evaluates LLMs on existing benchmarks or filtered subsets of these benchmarks. |
| Hardware Specification | No | The paper mentions, "We are grateful for the TPU compute support provided by the Google TRC program," but it does not provide specific details about the type of TPU, its configuration, or any other hardware specifications (e.g., GPU models, CPU types) used for the experiments. |
| Software Dependencies | No | The paper discusses various LLM models (e.g., GPT-4o, Claude-3.7-Sonnet) that are either used as oracles or are the subject of evaluation. However, it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, specific libraries) required to implement or reproduce the authors' own methodology. |
| Experiment Setup | Yes | To obtain the ranking of LLMs generation capabilities, R(g), we apply the evaluation oracle J to evaluate the LLMs outputs for the instruction set I. The evaluation is conducted in the manner of pairwise comparison... The prompt template used for the pairwise comparison is included in Appendix B... each output pair is evaluated twice by swapping the order of the two outputs. To derive the ranking of LLMs evaluation capabilities, R(e), we propose to evaluate them using the evaluation result of the preference oracle as the ground-truth... We choose to use inter-annotator agreement, specifically Cohen s Kappa, as the main metric... |