Prediction-Powered Ranking of Large Language Models

Authors: Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, Manuel Rodriguez

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using pairwise comparisons made by humans in the LMSYS Chatbot Arena platform and pairwise comparisons made by three strong large language models, we empirically demonstrate the effectivity of our framework and show that the rank-sets constructed using only pairwise comparisons by the strong large language models are often inconsistent with (the distribution of) human pairwise preferences.
Researcher Affiliation Academia Ivi Chatzi Max Planck Institute for Software Systems Kaiserslautern, Germany ichatzi@mpi-sws.org Eleni Straitouri Max Planck Institute for Software Systems Kaiserslautern, Germany estraitouri@mpi-sws.org Suhas Thejaswi Max Planck Institute for Software Systems Kaiserslautern, Germany thejaswi@mpi-sws.org Manuel Gomez Rodriguez Max Planck Institute for Software Systems Kaiserslautern, Germany manuelgr@mpi-sws.org
Pseudocode Yes Algorithm 1: It estimates ˆθ and bΣ using prediction-powered inference.
Open Source Code Yes An open-source implementation of our methodology as well as the data on pairwise preferences of strong LLMs used in our experiments are available at https://github.com/Networks-Learning/prediction-powered-ranking.
Open Datasets Yes Our starting point is the Chatbot Arena dataset [12], which comprises 33,481 pairwise comparisons made by 13,383 humans about the responses given by 20 different LLMs to 26,968 unique queries. In what follows, we refer to each pair of responses to a query by two different LLMs and the query itself as an instance. As an initial pre-processing, we filter out any instance whose corresponding query is flagged as toxic or multiturn. Then, we gather pairwise comparisons made by three strong LLMs, namely GPT-3.5-turbo-0125 (GPT3.5), GPT-4-0125-preview (GPT4) and Claude-3-Opus-20240229 (CL3), about all the (pre-processed) instances from the Chatbot Arena dataset.
Dataset Splits No To draw reliable conclusions, in each experiment, we construct rank-sets 1,000 times and, each time, we use a random set of N + n = 6,336 instances with an equal number of instances per pair of models, out of the 14,947 instances. The values of N and n vary across experiments and they define two random subsets, also with an equal number of instances per pair of models. The paper implies train/test but does not explicitly mention a separate validation split or its proportion.
Hardware Specification Yes Our experiments are executed on a compute server equipped with 2 AMD EPYC 7702 processor with 64 cores per processor and 2 TB of main memory.
Software Dependencies Yes Our algorithms are implemented in Python 3.11.2 programming language using Num Py and Sci Py open-source libraries for efficient matrix operations. Further, we use the matplotlib package to facilitate visualizations of our results.
Experiment Setup Yes To draw reliable conclusions, in each experiment, we construct rank-sets 1,000 times and, each time, we use a random set of N + n = 6,336 instances with an equal number of instances per pair of models, out of the 14,947 instances. The values of N and n vary across experiments and they define two random subsets, also with an equal number of instances per pair of models.