Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, Marzieh Fadaee

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct an extensive evaluation of Elo behavior across simulated and real-world scenarios, demonstrating that individual Elo computations can exhibit significant volatility. We show that both axioms are not always satisfied, raising questions about the reliability of current comparative evaluations of LLMs.
Researcher Affiliation Industry Meriem Boubdir Cohere For AI meri.boubdir@gmail.com Edward Kim Cohere edward@cohere.com Beyza Ermis Cohere For AI beyza@cohere.com Sara Hooker Cohere For AI sarahooker@cohere.com Marzieh Fadaee Cohere For AI marzieh@cohere.com
Pseudocode No The paper provides mathematical formulas for the Elo algorithm (Equations 1 and 2) but does not include a distinct pseudocode block or algorithm section.
Open Source Code No The paper's NeurIPS checklist states that 'We provide thorough details on generating synthetic human feedback data... These comprehensive instructions should enable other researchers to effectively replicate our experiments and verify our results.' However, it does not provide a direct link to a code repository or explicitly state that the source code for the methodology is released.
Open Datasets Yes We use the LMSYS Chatbot Arena dataset [34], an open-source collection of human preference data derived from unique users interactions with two distinct models responding to a set of userdefined prompts.
Dataset Splits No The paper describes generating synthetic data and sampling from real-world data (e.g., 'sampling a fixed number, Nsample, for each pair'), but it does not specify traditional training, validation, and test dataset splits as typically used for training machine learning models.
Hardware Specification Yes We generated completions in batches sized between 8 and 50, depending on the size of each model evaluated, using an Nvidia A100 GPU with 40GB memory for efficient computation. Inference was performed in a Bfloat16 setting to reduce memory usage, as deatailed in referenced work.
Software Dependencies No The paper mentions 'Bfloat16 setting' for inference but does not specify any software names with version numbers (e.g., Python, PyTorch, CUDA, or specific libraries).
Experiment Setup Yes Experimental Setup To quantify the effect of match-up ordering, we generate a baseline sequence of Ngames = 1000 match outcomes... Nperms is varied from a minimum of 1 to a maximum of 10k... We extend our previous approach by conducting tests across a range of winning probabilities and multiple K-factor values (1, 8, 16, 32, 64).