Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
Authors: Ruosen Li, Teerth Patel, Xinya Du
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments. |
| Researcher Affiliation | Academia | Department of Computer Science, The University of Texas at Dallas EMAIL |
| Pseudocode | Yes | The detailed equivalent implementation of PR is shown in the Algorithm 2 in Appendix E. For more details, please refer to Algorithm 1 in Appendix E. |
| Open Source Code | No | The paper does not contain any explicit statement or link confirming the release of their specific PRD methodology's source code. |
| Open Datasets | Yes | We select two meta-evaluation datasets, LFQA (Xu et al., 2023) and Vicuna80, with human annotations for pairwise comparisons, to measure the correlation between our evaluation methods and human judgments. |
| Dataset Splits | No | The paper describes using existing datasets (LFQA, Vicuna80, Summ Eval) for evaluation and how human annotations are used to determine preferences. However, it does not specify explicit training/validation/test splits for any models or experiments described, as the focus is on evaluating LLM evaluation methods rather than training new models. |
| Hardware Specification | No | For Vicuna-13b, we use the default version from Chiang et al. (2023). For all other API-based LLM models, we use specific versions of each, i.e., GPT-4-0613, GPT-3.5-turbo-0613, Claude-1, and Text-Bison@001 for GPT-4, GPT-3.5, Claude, and Pa LM-2 respectively. The experiments rely on API access to these LLMs, and no specific hardware for their own computation is mentioned. |
| Software Dependencies | Yes | For all other API-based LLM models, we use specific versions of each, i.e., GPT-4-0613, GPT-3.5-turbo-0613, Claude-1, and Text-Bison@001 for GPT-4, GPT-3.5, Claude, and Pa LM-2 respectively. |
| Experiment Setup | Yes | For discussions in the PD method, we set the maximum number of turns as 4. Moreover, the default temperature for all models is 0.2. |